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Abstract 

Conventional noncooperative game theory hypothesizes that the joint 
(mixed) strategy of a set of reasoning players in a game will necessarily 
satisfy an "equilibrium concept" . The number of joint strategies satisfying 
that equilibrium concept has measure zero, and all other joint strategies 
are considered impossible. Under this hypothesis the only issue is what 
equilibrium concept is "correct". 

This hypothesis violates the first-principles arguments underlying prob- 
ability theory. Indeed, probability theory renders moot the controversy 
over what equilibrium concept is correct — while in general there are joint 
(mixed) strategies with zero probability, in general the set {strategies with 
non-zero probability} has measure greater than zero. Rather than a first- 
principles derivation of an equilibrium concept, game theory requires a 
first-principles derivation of a distribution over joint strategies. 

However say one wishes to predict a single joint strategy from that 
distribution. Then decision theory tell us to first specify a loss function, 
a function which concerns how we, the analyst/scientist external to the 
game, will use that prediction. We then predict that the game will result 
in the joint strategy that is Bayes-optimal for that loss function and dis- 
tribution over joint strategies. Different loss functions — different uses 
of the prediction — give different such optimal predictions. There is no 
more role for an "equilibrium concept" that is independent of the distribu- 
tion and choice of loss function. This application of probability theory to 
games, not just within games, is called Predictive Game Theory (PGT). 

This paper shows how information theory provides a first-principles 
argument for how to set a distribution over joint strategies. The con- 
nection of this distribution to the bounded rational Quantal Response 
Equilibrium (QRE) is elaborated. In particular, taking the QRE to be 
an approximation to the mode of the distribution, correction terms to the 
QRE are derived. In addition, some Nash equilibria are not approached 
by any limiting sequence of increasingly rational QRE joint strategies. 
However it is shown here that every Nash equilibrium is approached with 
a limiting sequence of joint strategies all of which have non-zero probabil- 
ity. (In general though not all strategies in those sequences are modes of 
the associated distributions over joint strategies.) 
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It is also shown that in many games, having a probability distribution 
with support restricted to Nash equilibria — as stipulated by conven- 
tional game theory — is impossible. So the external analyst should never 
predict a Nash outcome for such games. PGT is also used to derive an 
information-theoretic (and model-independent) quantification of the de- 
gree of rationality inherent in a player's behavior. This quantification 
arises from the close formal relationship between game theory and sta- 
tistical physics. That close relationship is also leveraged to extend game 
theory to situations with stochastically varying numbers of players. This 
extension can be viewed as providing corrections to the replicator dynam- 
ics of conventional evolutionary game theory. 

1 Introduction 

Consider any scientific scenario, in which one wishes to predict some character- 
istic of interest y concerning some physical system. To make the prediction one 
starts with some information/data/prior knowledge J* concerning the system, 
together with known scientific laws. One then uses probabilistic inference to 
transform J? into the desired prediction. In particular, in Bayesian inference we 
produce a posterior probability distribution P(y \ J*). 

Such a distribution is a far more informative structure than a single "best 
prediction" . However if we wish to synopsize the distribution, we can distill 
it into a single prediction. One way to do that is to use the mode of the 
posterior as the prediction. This is called the Maximum A Posterior (MAP) 
prediction. Alternatively, say we are given a real-valued loss function, L(y,y / ) 
that quantifies the penalty we incur when we predict y' and the true value is 
y. The Bayes optimal prediction then is the value of y that minimizes the 
posterior expected loss, J dyL(y,y')P(y \ J ! ). As an example, say that y E K 
and L(y,y') = (y — y') 2 - Then the Bayes-optimal prediction is the posterior 
expected value of y 7 J dy yP{y \ J?). Formally, to predict any other value than 
the Bayes-optimal prediction violates Cox's and Savage's axioms concerning the 
need to use probability theory when doing science (see |2 E] and Sec. 12.21 
below). 

As a technical comment, in practice evaluating the Bayes-optimal prediction 
may be computationally difficult. In addition doing so often requires that the 
practitioner specify "prior probabilities", which when done poorly can lead to 
bad results. Finally, some non-Bayesian axiomatizations of inference have been 
offered g]. 

Due to these reasons, when J? consists of experimental data, in practice 
often non-Bayesian techniques (e.g., Fisherian or Waldian minimax) are used 
instead of pure Bayesian techniques. For example, one might use an unbiased 
estimate of the data's mean and an associated confidence interval rather than use 
a Bayes-optimal predicted mean and associated posterior. More sophisticated 
non-Bayesian techniques might use the bootstrap [S] procedure, stacking 
etc. 1 

x Note though that often such techniques can be cast as approximations to Bayesian tech- 
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The general relationship between such non-Bayesian techniques and purely 
Bayesian techniques is exceedingly subtle J7J- However in the numeric sciences, 
even with non-Bayesian techniques, to analyze experimental data the broad 
approach is to use J (the experimental data) to generate a probability dis- 
tribution over the quantity of interest and (if desired) generate an associated 
single prediction. (This is why numeric data is presented with "error bars".) 
No one has ever suggested why this broad approach would be appropriate when 
one is analyzing a physical system without humans and J* is experimental data, 
but not appropriate when one is analyzing a physical system of a set of human 
players engaged in a game and J is the game structure. 

Indeed, the Bayesian approach can be motivated in purely game-theoretic 
terms. Say we have a game 7 of some sort. Let Ax indicate the set of all 
possible joint mixed strategies in 7. Now consider a "meta-game" T that consists 
of a scientist (S) playing against Nature (JV). In this meta-game iV's space of 
possible moves is Ax, i.e., the set of all possible joint mixed strategies in 7. 
The move of the scientist S is a prediction of what element of Ax player N will 
adopt. As usual in games against Nature, TV has no utility function. However 
S has a utility function, given by the negative of a loss function that quantifies 
how accurate her move is as a prediction of iV's move. So to maximize her 
expected utility the scientist wants to choose her move(s) — her prediction of 
the joint mixed strategy that governs the game 7 — to minimize her expected 
loss. To do that she needs N's mixed strategy P(q G Ax)- 

Now S only has partial information about N (e.g., she only knows the utility 
functions of the players in 7 and their move spaces). Therefore her first task 
is to translate that information, J?, into a distribution over N's possible moves 
(i.e., over the possible mixed strategies of the game 7). Intuitively, she must 
translate her partial information into an inference of N's mixed strategy. How 
to do this is the crux of PGT. Having done this, just as in conventional non- 
cooperative game theory, the scientist S choose her move to minimize expected 
loss (maximize expected utility) in T, i.e., she predicts that the joint mixed 
strategy of the game 7 is the one that, under expectation, is as close to the actual 
one as possible. That is the scientist's assessment of the game's "equilibrium". 

Note that for the same 7 and the same inference by S of the mixed strategy 
P{q), if we change S's loss function, we change her prediction. This is a game- 
theoretic example of how changing the loss function of the scientist external to 
the game will in general change the associated equilibrium concept mapping 7 
to S's prediction for 7's mixed strategy. 

The broad approach of converting / to a distribution and then — if one 
has a loss function — converting that distribution to a final prediction is the 
one that will be adopted in this paper. This paper is about how to use this 
approach to analyze games. In other words, it about how to infer the mixed 
strategy P(q) of a Nature whose moves are joint mixed strategies of a game 7. 

As a particular example of the implications of this approach, suppose that 
our prior information £ concerning the game 7 does not explicitly tell us that 

niques [T]|2]|3|. 
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the players in 7 are all fully rational. Then in general, probabilistic inference will 
produce a non-delta function distribution over the "rationalities" of the players 
(however that term is defined). In this way, applying probabilistic inference to 
games intrinsically results in bounded rationality. 

To be more concrete, note that mixed strategies in a non-cooperative game 
are themselves probability distributions. Therefore probabilistic inference con- 
cerning mixed strategies involves probability density functions over probability 
distributions. Now in Shannon's information theory [SIEIE^SI the fundamental 
physical objects under consideration are probability distributions, in the form 
of stochastic communications channels. Accordingly, probabilistic inference in 
information theory also involves probabilities of probabilities. This makes the 
mathematical tools for probabilistic analysis in information theory contexts — 
a topic already well-researched — well-suited to a probabilistic analysis of non- 
cooperative games. 

More precisely, a central concept in information theory is a measure of the 
amount of information embodied in a probability distribution p, known as the 
Shannon entropy of that distribution, S(p). Amongst its many other uses, 
Shannon entroy can be used to formalize Occam's razor based on first-principles 
arguments. This formalization is known as the minimum information (Maxent) 
principle, and its Bayesian formulation is embodied in what is known as the 
entropic prior. This can serve as the foundation of a first-principles formalism 
for probabilistic inference over probability distributions. 

Using entropy to perform probabilistic inference this way has proven extraor- 
dinarily successful in an extremely large number of applications, ranging from 
signal processing to machine learning to statistics. Recently it has also been 
realized that the mathematics underlying this type of inference can be used to 
do distributed control and/or optimization. In that context the mathematics is 
known as Probability Collectives (PC). Preliminary experiments validate PC's 
power for control of real-world (hardware) systems, especially when the system 
is large (See collectives.stanford.edu and [IH [T2J [T31 [TH HSI ISl H3 [TBI . ) 

As another example of the successes of entropy-based inference, consider the 
problem of predicting the probability distribution p over the joint state of of 
a huge number of interacting particles. This is the problem addressed by sta- 
tistical physics. As first realized by Jaynes, such prediction is an exercise in 
probabilistic inference of exactly the sort Maxent can be applied to. Accord- 
ingly statistical physics can be addressed — and in fact derived in full — using 
Maxent ^0 ^ . In light of all the tests physicists have done of the predictions of 
statistical physics, this means that there are (at least) tens of thousands of ex- 
perimental confirmations of that principle in domains with a very large number 
of interacting particles. 

We can similarly use Shannon's entropy to do inference of the (the distri- 
bution governing the) joint mixed strategy q(x) — Yli=iQii x i) in any game 
involving N players with pure strategies {x{\. Probabilistic inference applied to 
game theory this way is known as Predictive Game Theory (PGT). In PGT, 
the whole point is to apply probability theory in general and Bayesian analysis 
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in particular to games and their outcomes. This contrasts with their use in 
conventional game theory *within* the structure of individual games (e.g., in 
correlated equilibria HOI)- 

1.1 The relation between PGT and conventional game 
theory: a first look 

Before presenting PGT in detail, this section illustrates some of its connection 
to other work in game theory and statistical physics. 

Statistical physics provides a unifying mathematical formalism for the physics 
of many-particle systems. Any question related to that type of physics can be 
analyzed, in principle at least, simply by casting it in terms of that formalism, 
and then performing the associated calculations. There is no need to introduce 
any new formalism for new questions, new Hamiltonians, etc. 

PGT arises from information theory similarly to how statistical physics arises 
from information theory, only in a different context. Accordingly, PGT can 
play an unifying role for games analogous to the one statistical physics plays 
for many-particle systems. All questions related to games can be analyzed, in 
principle at least, simply by casting them in terms of PGT, and then performing 
the associated calculations. There is no need to create new formalisms for new 
game theory issues, new presumptions about the way humans behave in games, 
etc; one simply casts them in terms of PGT. 

Now in PGT the idea of an "equilibrium concept" , so central to conventional 
game theory, does not directly arise. Let q(x) indicate a joint mixed strategy 
over joint move x. Generically in PGT, the support of the probability density 
function over g's, P{q), has non-zero measure. In this, one does not allow 
only a single "equilibrium" q, or even a countable set of q's comprising the 
"equilibrium" joint strategies; the number of allowed g's is uncountable. 

The fact that the support of P has non-zero measure typically ensures a 
built-in "bounded rationality" to PGT. This is because typically there will be 
q that are allowed (i.e., that have non-zero probability) in which one or more 
of the players is not fully rational. This aspect of the measure of P has other 
consequences as well. For example, it means that rather than consider the 
values of economic quantities of interest at a single (or at most countable set of) 
equilibrium q, as is conventionally done, one should consider the expected value 
of such quantities under P(q). This means that attributes of those quantities 
like how nonlinear they are (which is crucial to approximating the integrals 
giving their expectation values) have consequences when PGT is used to analyze 
economics issues, consequences that they do not have when conventional game 
theory is used. 

Are there quantities in PGT that are analogous to equilibrium concepts, 
even if P's support has non-zero measure? One possible interpretation of what 
an "equilibrium concept" could mean in PGT is the Bayes-optimal q. Note 
though that the Bayes-optimal q in general depends of the loss function of the 
external scientist making a prediction about the physical system (i.e., about the 
game). So consider two scenarios, both concerning the exact same game, with 
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the exact same knowledge concerning the game, and therefore the exact same 
distributions over joint mixed strategies. However have the loss function of the 
external scientist (reflecting how their prediction will be used) differ between 
the two scenarios. Then the Bayes-optimal prediction will also differ between 
the two scenarios. So the very choice of "equilibrium concept" is determined (in 
part) by the external scientist analyzing the game; the "equilibrium" joint mixed 
strategy is not purely a function of the game itself, but rather also involves the 
external scientist making predictions about the system. 

This dependence on the external scientist of PGT's (analogue of the con- 
ventional) notion of a game's equilibrium is not a philosophical preference. It 
is not something that we have discretion to adopt or not. Rather it is intrinsic 
to our analyzing games with human players the same we analyze other physical 
systems in the Bayesian paradigm: by deriving distributions over truths based 
on partial information, and then (if needed) making single predictions based on 
that distribution together with an external loss function. Under this interpre- 
tation of equilibrium concept, we have no choice but to accept the dependence 
of point predictions on the external scientist making the prediction. 

Another possible interpretation of the "equilibrium" of a game is as the 
posterior 



Just like the Bayes-optimal q, P(x \ .J?) reflects the ignorance of us, the ex- 
ternal scientists concerning the game and its players, as well as the intrinsic 
noise/randomness in how the players choose their moves. Unlike the Bayes- 
optimal q though, P(x | J?) does not depend on the loss function of the external 
scientist. 

On the other, in general P(x \ J?) will not be a product distribution, i.e., 
it will not have the moves of the players be independent. This is true even 
though P(q | J) is restricted to such distributions (a linear combination of 
product distributions typically is not a product distribution). In addition, say 
that P(q | y) is restricted to Nash equilibria q. Typically, if there are more 
than one such equilibria (i.e., the support of P contains more than one point), 
then under P(x \ none of the players is playing an optimal response to the 
mixed strategy over the other players. In other words, even though we might 
know that all the players are in fact perfectly rational, our prediction of their 
moves has "cross-talk" among the multiple equilibria and does not have perfect 
rationality. 

Typically for any P(q \ there is only one (perhaps difficult to evaluate) 
Bayes-optimal prediction (e.g., for quadratic loss functions that prediction is the 
posterior mean, J dq qP(q | Similarly J dq q(x)P(x | J?) is always unique. 




(1) 
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So under either of this interpretations of "equilibrium concept" , every game has 
a unique equilibrium. In this, whichever of these PGT-based interpretations 
of equilibrium we adopt, all work in conventional game theory that attempts 
to "fix" the possible multiplicity of conventional concepts of equilibrium (e.g., 
the many proposed refinements of the Nash equilibrium concept) is rendered 
moot. The same fate obtains for the different equilibrium concepts that have 
been proposed in cooperative game theory. 

As a practical matter, often calculating the exact Bayes-optimal q can be 
quite difficult. As a substitute, even if it is not Bayes-optimal, we can calculate 
the MAP q. When P(q | J?) is peaked the MAP q should be a good approx- 
imation to the Bayes-optimal q. Indeed, it is common in Bayesian analysis 
to approximate the Bayes-optimal prediction by expanding the posterior as a 
Gaussian centered on the MAP prediction. 

This MAP q is the minimizer of a Lagrangian functional Jzf(q). In general 
this MAP q is a bounded rational equilibrium rather than a Nash equilibrium. 
As shown below, this MAP bounded rational equilibrium can often be approx- 
imated by simultaneously having each player i's mixed strategy qi(xi) be a 
Boltzmann distribution over the values of its expected utility for each of its 
possible moves: 

qi (xi) oc e/W-'l**) v« (2) 

where the joint distribution q(x) = Ili^^i) an d u l {x) is player i's utility 
function. 

In general there may be more than one solution to the set of coupled equa- 
tions Eq. El (See |22 for examples of closed-form solutions to this set of 
coupled equations.) In conventional game theory, the set of all such solutions 
is sometimes called the (logit response) Quantal Response Equilibrium (QRE) 
[221 EH! 12]- It has been used as a convenient way to encapsulate bounded ratio- 
nality. Typically approximating the MAP mixed strategy with the QRE should 
incur less and less error the more players there are in the game. However as 
discussed below, for small games the QRE may be a poor approximation to the 
MAP (which itself is an approximation to the Bayes-optimal prediction). Below 
the correction terms of the QRE (as an approximator of the MAP distribution) 
are calculated. 

Another relation between the QRE and PGT, one that doesn't involve ap- 
proximations, starts with the fact that at Nash equilibrium each player i sets its 
strategy to maximize its expected utility E gijq _ i (u l ) for fixed <?_i. 2 Consider 
instead having each player i set to optimize an associated functional, the 
"maxent Lagrangian" : 

= E qi , q _ i (v. i )-T i S{q i ,q-i). (3) 

For all Ti — > the equilibrium q that simultaneously minimizes Jz?i Vi is a Nash 
equilibrium 123 E3 EH1 EH ES EH For T; > one gets bounded rationality. 

2 Throughout this paper the minus sign before a symbol specifying a particular player 
indicates the set of all of the other players, and similarly for a minus sign before a set of 
player symbols. 
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Indeed, under the identity T,; = fi^ 1 Vi the solution to this modified Nash 
equilibrium concept turns out to be the QRE. 

As discussed in |25j . the maxent Lagrangian also arises in statistical physics, 
where it is called (a mean field approximation to) the "free energy". This 
formal connection between PGT/QRE and statistical physics can be exploited 
in several ways. As an example, consider the case where one's prior information 
consists of the expected energy of a set of interacting particles with joint state r, 
a scenario known as the "canonical ensemble" in statistical physics (CE). In this 
situation the MAP estimate of the density function p(r) using an entropic prior 
is the minimizer of _Sf (p) — E V (H) —TS(p), where H is the energy of the system 
of particles. In light of the formula for the maxent Lagrangian, this suggests 
tha bounded rational players in a game can be made formally identical to the 
particles in the CE. Under this identification, the moves of the players play the 
roles of the states of the particles, and particle energies arc translated into player 
utilities. Particles are distributed according to a Boltzmann distribution over 
their energies, and mixed strategies are Boltzmann distributions over expected 
payoffs. 3 

This connection between PGT and statistical physics raises the potential 
of transferring some of the powerful mathematical techniques that have been 
developed in the statistical physics community into game theory. As an example, 
in the "Grand Canonical Ensemble" (GCE) the number of particles of various 
types is variable rather than being pre-fixed. One's prior information is then 
extended to include the expected numbers of particles of those types. This 
corresponds to having a variable number of players of various types in a bounded 
rational game. This suggests how to extend game theory to accommodate games 
with statistically varying numbers of players. Among other applications, this 
provides us with a new framework for analyzing games in evolutionary scenarios, 
different from evolutionary game theory. (A different type of "GCE game" is 
analyzed below.) Even 

There arc many other aspects of statistical physics that might carry over 
to PGT. For example, even in the CE, often there are regimes where as some 
parameter of the system is changed an infinitesimal amount, the character of 
the system changes drastically. These are known as "phase transitions" . The 
connection between the math of PGT and that of statistical physics suggests 
that similar phenomena may arise in games with human players. 

PGT has many other advantages in addition to providing a way to exploit 
techniques from statistical physics in the context of noncooperative games. For 
example, as illustrated below it provides a natural way to quantify the rationality 
of experimentally observed behavior of human subjects. One can then, for 
example, empirically observe the dynamic relationship coupling the rationalities 
of real players as they play a sequence of games with one another. (Since such 
correlations are inherently a property of distributions across mixed strategies, 
they are not readily analyzed using conventional non-distribution-based game 

3 Note that having the probability density over mixed strategies follow a Boltzmann distri- 
bution does not mean that functionals of that density are Boltzmann-distributed. In partic- 
ular, the distribution over values of the utility function need not be Boltzmann-distributed. 
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theory.) 

Another strength of PGT arises if we change the coordinates of the under- 
lying space of joint pure strategies {x} . After such a change, our mathematics 
describes a type of bounded rational cooperative game theory in which the moves 
of the players become binding contracts they all offer one another [3U||!^. In this 
sense, PGT provides a novel relation between cooperative and noncooperative 
game theory. 

1.2 Roadmap 

The purpose of this paper, like that of the original work on game theory, is to 
elucidate a framework for analyzing the reasonably imputed consequences about 
the behavior of the players when all one knows is the game structure. If possible, 
this framework should be able to accommodate extra knowledge concerning the 
game and/or the players if it is available. Loosely speaking, the goal is to provide 
for game theory the analog of what the canonical and grand canonical ensembles 
provide for statistical physics: a first-principles mathematical scaffolding into 
which one inserts one's knowledge concerning the system one is analyzing, to 
make predictions concerning that system. (Sec the future work section below 
for further discussion of this point.) 

To do this, the next section starts by cursorily reviewing noncooperative 
game theory, Bayesian analysis and the entropic prior arising in information 
theory. In an appendix that prior is illustrating by showing how it can be used 
to derive statistical physics. In the following section foundational issues of PGT 
and associated mathematical tools are presented. 

The next two sections form the core of the player. The first of them applies 
the entropic prior to infer mixed strategies of coupled players in a game 7. 
This application can be viewed as a prescription for how to infer the mixing 
strategy P(q \ <#) adopted by a Nature involved in a meta-game with a scientist, 
where the moves q of Nature are mixed strategies in 7. This section then 
relates this coupled-players analysis to the QRE. The section after this considers 
independent players, leveraging the analysis for coupled players. 

The following section illustrates some of the breadth of PGT. It is shown 
there how bounded rationality arises formally as a cost of computation for the 
independent players scenario. We then present rationality functions. These are 
a model-independent way to quantify the (bounded) rationality of the mixed 
strategies followed by real-world players. This section ends by showing how to 
apply PGT to games with stochastically varying numbers of players. 

An appendix discusses the relation between PGT and previous work, and 
more generally the history of attempts to apply information theory within game 
theory. 
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2 Preliminaries 



This section first reviews non-cooperative game theory. It then reviews informa- 
tion theory and the associated Bayesian analysis. It ends by illustrating that 
analysis with a review of how it can be used to derive statistical physics. It 
is recommended that those already familiar with these concepts still read the 
middle subsection on Bayesian analysis. 

2.1 Review of noncooperative game theory 

In conventional noncooperative normal form game |321 1331 1341 1351 l3l)] theory 
one has a set of N independent players, indicated by the natural numbers {1, 
2, . . . , N}. Each player i has its own finite set of allowed pure strategies, 
each such pure strategy written as Xi £ Xi. We indicate the the size of that 
space of possible pure strategies by player i as \Xi\. The set of all possible joint 
strategies is X = X\ x X2 x . . . x Xn with cardinality \X\ = YiiLi l-^»l> a generic 
element of X being written as x. 

A mixed strategy is a distribution qi(xi) over player i's possible pure 
strategies, {xi}. In other words, it is a vector on the |A"i|-dimensional unit 
simplex, Ax t - Each player % also has a utility function (sometimes called a 
"payoff function" ) u % that maps the joint pure strategy of all N of the players 
into a real number. 

As a point of notation, we will use curly braces to indicate an entire set, 
e.g., {/3i} is the set of all values of Pi for all i. We will also write Ax to refer 
to the Cartesian product of the simplices Ax t , so that mixed joint strategies 
(i.e., product distributions) are elements of Ax- We will sometimes refer to u % 
as player i's "payoff function" , and to player i's pure strategy Xi as its "move". 
x is the joint move of all N players. As mentioned above, we will use the 
subscript — i to indicate all moves / distributions / utility functions, etc., other 
than i's. We will use the integral symbol with the measure implicit, so that it 
can refer to sums, Lebesgue integrals, etc., as appropriate. In particular, given 
mixed strategies of all the other players, we will write the expected utility of 
player i as E(u l ) — J dx Ylj qj(xj)u l (x). As a final point of notation, we will 
write a to mean a finite indexed set all of whose components are either real 
numbers are infinite (greater than any real number). We will then write a >b 
to indicate the generalized inequality that Vi, either ai and bi are real numbers 
and ai > bi, both en and bi are infinite, or bi is a real number and en is infinite. 
Also, in the interests or expository succinctness, we will be somewhat sloppy in 
differentiating between probability distributions, probability density functions, 
etc.; generically, "P(. . .)" will be one or the other as appropriate. 

Much of noncooperative game theory is concerned with equilibrium con- 
cepts specifying what joint-strategy one should expect to result from a particu- 
lar game. In particular, in a Nash equilibrium every player adopts the mixed 
strategy that maximizes its expected utility, given the mixed strategies of the 
other players. More formally, Vi,qi — argmax^/ J dx q[ Yij=a 1j ( x j ) u% {x) |371 
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One problem with the Nash equilibrium concept is its assumption of full 
rationality. This is the assumption that every player i can both calculate what 
the strategies qj^i will be and then calculate its associated optimal distribution. 4 
This requires in particular that each player calculate the entire joint distribution 
q(x) = Y[j 1j{ x j)- If f° r n ° other reasons than computational limitations of real 
humans, this assumption is essentially untenable. This problem is just as severe 
if one allows statistical coupling among the players |20l I32| . 

For simplicity, throughout each analysis presented in this paper we will treat 
N, the pure strategy spaces, the associated utility functions, and the statisti- 
cal independence of the pure strategies chosen by the players, as fixed parts 
of the problem definition rather than random variables. Further we will im- 
pose no a priori restrictions about whether the players have encountered one 
another before, what information they have about one another and the game 
they're playing, whether they have engaged in the game before, what their in- 
formation sets are, whether there are any social norms at work on them, etc. 
We do not even require, a priori, that the players be prone to human psy- 
chological idiosyncracies. To incorporate any information of this sort into the 
analysis would mean modifying the priors and/or likelihoods considered below 
in a (mostly) straightforward way, but is beyond the scope of this paper. 

2.2 Review of Bayesian analysis and decision theory 

Consider any scenario in which we must reason about attributes of a physical 
system without knowing in full all salient aspects of that system. This is the 
basic problem of inductive inference. How should we do this reasoning? Many 
different desiderata, arising from work by De Finetti, Cox, Zellner and many 
others, lead to the same conclusion: if our goal is to assign real-valued numbers 
to the different hypotheses concerning the system at hand, we should use the 
rules of Probability theory [HI EHI SHI El SH S21 CI El E| - In particular, this 
implies that we should use Bayes' theorem to calculate what we want to know 
from what we are told/assume/observe/know: 

P(truth z | data d) oc P(d | z)P(z) (4) 

where the proportionality constant is set by the requirement that P(truth z \ 
data d) be normalized, and "data" means everything we are told/assume/observe/know 
concerning the system. P(truth z | data d) is called the posterior probability, 
P(data d | truth z) is called the likelihood, and P{z) is called the prior. 

Say that rather than a full posterior distribution, for some reason we must 
predict a single one of the candidate hypotheses z. According to Savage's ax- 
ioms, to do this we must be provided with a loss function L(y, z) that maps 
any pair of a truth z and a prediction y to a real-valued loss (see and 
various chapters in 33.,). Then the associated Bayes optimal prediction is 

4 Here we use the term "bounded rationality" in the broad sense, to indicate any mixed 
strategy that does not maximize expected utility, regardless of how it arises. 
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sigmhiy Ep(L(y, z)) where the expectation is over the posterior distribution 
P(truth z | data d). 5 

Note that the loss function is determined by the scientist external to the 
system who is making the prediction; it is not specified in the definition of the 
system under consideration. 

According to the foregoing, to do statistical inference for a particular physi- 
cal scenario our first task is to translate the "particular physical scenario" we're 
considering into a mathematical formulation of possible truths z, data d, etc. 
Having done that, we can employ mathematical tools like Bayes' theorem, ap- 
proximation techniques for finding Bayes-optimal predictions, etc. to analyze 
our mathematical formulation. After doing this we use our translation to con- 
vert all this back into the physical scenario. This translational machinery is 
how we couple the abstract mathematical structure of probability theory to our 
particular physical inference problem. 

To assist us in making this translation, we imagine an infinite set of instances 
of our physical scenario. All of those instances share the physical characteristics 
of our scenario that fix our statistical inference problem. Every other physical 
characteristic is allowed to vary across those instances. In this way the set of all 
of those instances define our statistical inference problem [7] . Formally, we define 
the invariant of our inference problem as the set of exactly those characteristics 
of the physical scenario, and no others, that would necessarily be the same if 
we were presented with a novel instance of the exact same inference problem. 
Equivalently, we can define the invariant as the set of all the physical instances 
consistent with those characteristics, and no instance that is inconsistent with 
those characteristics. By explicitly delineating an inference problem's invariant, 
we can mathematically formalize that problem. 

As discussed in the introduction, this Bayesian perspective is inherent in 
much of conventional game theory. Most obviously, all the work on Bayesian 
games, correlated equilibria, etc. adopts elements of the Bayesian perspective. 
In addition, we can define a meta-game T of a Scientist S playing against Na- 
ture N in which the possible states of N are the possible mixed strategies q of 
the game 7 we wish to analyze. The move of S in T is interpreted as a pre- 
diction of S"s move, i.e., of the mixed strategy of 7. The utility function of S 
is interpreted as the (negative) of the loss function of S. Accordingly, for S to 
do exactly what is prescribed by conventional game theory — choose a strat- 
egy that maximizes her expected utility — she must make the Bayes-optimal 
prediction of the outcome of the game. To do this she must infer iV's mixed 
strategym, P(q). However she has only limited information concerning N, e.g., 
the specification of 7. Accordingly, the crucial issue in game theory is how, 
based on her limited information concerning 7, the external scientist should in- 

5 There is controversy about the precise details of Savage's axioms and their implica- 
tions, the precise way priors should be chosen, and even the precise physical meaning of 
"probability" 7 . Such details are not important for current purposes. Other choices can be 
made, based on other desiderata. However typically the broad outlines of any approach based 
on such alternatives is the same: to do inference one constructs a probability distribution over 
possible truths and then, if needed, distills that distribution into a single prediction. 
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fer N's mixed strategy, P(q). This is a problem of how to infer a distribution 
over distributions. 

This inference, done using invariants, is the topic of PGT. In this paper 
we will only explore such use of invariants in conjunction with the entropic 
prior, since that prior is directly concerned with inferring distributions over 
distributions. However other priors also merit investigation. 

2.3 Review of the entropic prior 

Shannon was the first person to realize that based on any of several separate 
sets of very simple desiderata, there is a unique real-valued quantification of 
the amount of syntactic information in a distribution P(y). He showed that 
this amount of information is (the negative of) the Shannon entropy of that 
distribution, S(P) = — J dy P(y)ln[j$jjl]. 6 Note that for a product distribution 

P(V) = Hi PM entropy is additive:^) = £\ S(P t ). 

So for example, the distribution with minimal information is the one that 
doesn't distinguish at all between the various y, i.e., the uniform distribution. 
Conversely, the most informative distribution is the one that specifies a single 
possible y. 

Say that the possible values of the underlying variable y in some particular 
probabilistic inference problem have no known a priori stochastic relationship 
with one another. For example, y may not be numeric, but rather consist of 
the three symbolic values, {red, dog, Republican}. Then simple desiderata- 
based counting arguments can be used to conclude the prior probability of any 
distribution p(y) is proportional to the entropic prior, exp (— aS{p)), for some 
associated non-negative constant aJ 

Intuitively, absent any other information concerning a particular distribution 
p, the larger its entropy the more a priori likely it is. 

If the possible y have a more overt mathematical relationship with one an- 
other, the situation is often not so clear-cut. For example, symmetry group 
arguments are often invoked in such situations, and can give more refined pre- 
dictions. Despite this, for the most important scenarios it considers, scenarios 
where it has had such great successes, statistical physics simply uses the en- 
tropic prior, as described below. In accord with this, in this paper attention 

®fi is an a priori measure over y, often interpreted as a prior probability distribution. 
Unless explicitly stated otherwise, here we will always assume it is uniform, and not write it 
explicitly. See 1151111 181. 

7 The issue of how to choose a — or better yet how to integrate over it — is quite subtle, 
with a long history. See in particular work on ML-II 39 and the "evidence procedure" 1101 . ~l 

8 Note that this is different from saying that the larger s is, the more a priori likely it is 
that the entropy of p is larger: 

Ps(s) = J dp S(S(p) - s)P(p) 

J dp 8(S(p) - s)exp (-aS(p)) 
J dp exp (— aS(p)) 
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will be restricted to the entropic prior. 

Say we have some information concerning p. Then by Bayes' theorem, 
the posterior probability of distribution p is 

P{p | J) oc exp i-aSip))PiJ \ p). (5) 

The associated MAP prediction of p based on J* is argmax p P(p | J 1 ). 

Intuitively, Eq. [5] pushes us to be conservative in our inference. Of all hy- 
potheses p equally consistent (probabilistically) with our provided information, 
we are led to prefer those that contain minimal extra information beyond that 
which is contained in the provided information. This is a formalization of Oc- 
cam's razor. 

Physically, J" is all characteristics of the system that would not change if 
we were presented with a novel instance of the exact same inference problem. 
From a frequentist perspective, it is an invariant across a set of experiments: 
J* delineates what characteristics of the system are fixed in those experiments, 
while all characteristics not in J? are allowed to vary. In essence, ^ is the 
invariant that defines the inference problem. 

In particular, includes any functions of p, Fip), such that we know (by 
the specification of the precise inference problem at hand) that Fip) would not 
change if we confronted a novel instance of the same inference problem. In 
general we may not know the actual value of Fip) that is shared among the 
instances specified by J'; we may only know that that value is the same in all 
of those instances. 9 

Note that ,f cannot specify p, the precise state of the system — there must 
be some salient characteristics of the system that are not fixed by J? . If this were 
not the case the likelihood P(j^ | p) would be a delta function, and therefore the 
prior would be irrelevant. In such a case, statistical inference would reduce to 
the truism "whatever happens happens" . Accordingly, we never have J* contain 
a set of functions {Flip)} whose values jointly fix p exactly. 

An important example of the foregoing occurs in statistical physics, where ^ 
is the observed temperature T of a physical system. T is taken to fix a function 
Fip), namely the expected energy H(x) of the system under distribution p(cc): 
F(p) = J dx Hix)pix) is fixed by ,J? ' . It is the application of the entropic prior 
to this situation that results in the canonical ensemble mentioned above. The 
number of experiments validating this application is extraordinary; including 
experiments in high school labs, it is probably on the order of 10 8 (at least). In 
this paper that application serves as a touchstone for how to translate J 1 into a 
distribution over distributions, and therefore as the primary analogy mentioned 
in the derivation of PGT. However in the interests of expediting the flow of this 
paper, that application is relegated to an appendix. 

9 Relating this back to the mathematics of probability theory, in such a case that value of 
F(p) is known as a hyperparameter. Formally, hyperparameters have their own priors. To 
get a final posterior over what we wish to infer — p — we must marginalize over possible 
values of all hyperparameters. Implicitly, the reason that here we simply choose one value of 
a hyperparameter and discard all others is that we expect the posterior distribution of the 
hyperparameter to be highly peaked, so that we do not need to carry out such marginalization. 
See the discussion of ML-II in [3n]|3S][IUJ, and also [TO]. 
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3 Predictive Game Theory - general considera- 
tions 



3.1 The two types of game theory 

Say we are presented with a noncooperative normal form game for N players 
other than us, and a set of N subjects who will fill the roles of the players in 
a fixed manner. We wish to make predictions concerning the outcome of that 
game when played by that set of players other than us. As discussed above, for 
us to obey Cox's axioms, when making those predictions we must use probability 
theory. If we wish to distill the probability distribution over outcomes into a 
single prediction, then to obey Savage's axioms we must have a loss function 
and use decision theory. 

Note that these normative axioms differ fundamentally from those that can 
be used to derive various equilibrium concepts. The axioms underlying equilib- 
rium concepts concern external physical reality, namely the players of the game. 
They concern something outside of our control. In other words, such axioms 
are hypothesized physical laws. Like any other such laws, they can be contra- 
dicted or affirmed by physical experiment, at least in theory. In fact, behavioral 
game theory 03| essentially does just that, experimentally determining what 
such hypothesized physical laws are valid. (See also all the work on behavior 
economics 

In contrast, Cox's and Savage's axioms are normative. They tell us how best 
to make predictions about the physical world. They make no falsifiable claims 
about the real world; it is under our control whether they will be followed, not 
the external world's control. Violating them in our analysis is akin to performing 
an analysis in which we violate the axioms defining the integers. 

Before showing some ways to arrive at a distribution over outcomes for this 
situation, we must clarify what the space of "outcomes" of the game is. There 
are two broad types of such spaces to consider, with associated types of game 
theory. 

In type I game theory, what a player chooses in any particular instance of 
the game is its move in that instance. (This is the analog of the variable y in 
AppEl) In general, in the real world a particular human's choice of move will 
vary depending on their mood, how distracted they are, etc. Physically, this 
variability arises from variability in the dynamics of neurotransmitter levels in 
the synapses in their brain during their decision-making, associated dynamical 
variability in the firing potentials of their neurons during that process, etc. 

Due to this variability the choice of each player is governed by a probability 
distribution. (This is the analog of the variable q in App[§J). So the joint choice 
of the players is also described by a distribution. We write that joint distribution 



15 



as 

q(x) = P(x | q) 

N-l 

= P(x N I q) Y[ P( x i I Q, x l+ i,x i+2 , ■ ■ ■) 

i=l 
N-l 

= P(x N | q) H P{ Xi | q) 
i=l 

N 

i=i 

= l[qi(xi), (6) 

i=l 

where the third equality follows from the statistical independence of the players' 
choices. So the probability of a particular joint move x is given by a product 
distribution q(x) = Yii 1i( x i) - 10 ^ the interaction between the humans is not, 
physically, a conventional normal form noncooperative game, then the space of 
allowed q must be expanded to allow q that statistically couple the moves. Such 
extensions are beyond the scope of this paper. 

q incorporates the subconscious biases of the players, the day-to-day distri- 
bution of their moods, and more generally the full physical stochastic nature 
of their separate decision-making algorithms. It also reflects what the players 
know about each other, whether they have directly interacted before, what they 
know about the game structure, their information sets, etc. In short, q is the 
physical nature of the game setup, in toto. Note that we cannot examine the 
precise states of all the neurons and neurotransmitters in the brains of the play- 
ers, and even if we could we cannot precisely evaluate the associated stochastic 
dynamics. Accordingly, while it is physically real, in practice it is impossible 
for us to ascertain a distribution qi exactly. 

In contrast, x is the quantity that the players consciously determine, and it is 
observable, q is instead the physical process that specifies how the players select 
that observable. Both of these quantities differ from our (limited) knowledge 
about q. That knowledge is embodied in probability distributions P(q)- P 
reflects us as much as the players. 

We imagine an infinite set of instances of our setup, i.e., an infinite set of 
g's. The invariant specifies all characteristics of the system — and only those 
aspects — such that if they had been different in some particular instance, we 
would know it. In going from one instance to the next, we assume the problem 
is "reset", i.e., there is no information conveyed from one instance to the next. 
(In particular the players' minds are "wiped clean" between instances.) Our 
inference problem is to predict q based on such an invariant, i.e., formulate the 

10 Loosely speaking, when used as an approximation in statistical physics, such product 
distributions are called "mean field theory". See 1461 . 
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posterior P(q | J'). So an "cntropic prior" concerns the probability of any such 
joint mixed strategy q, our likelihood must concern q, etc. 

In type II game theory, what player i chooses in any particular instance of 
a game is a mixed strategy qi(xi). Each player i's mixed strategy is separately 
randomly sampled, "by Nature", to get the player's move Xi. So as in type I 
games, q(x) = l\i<li( x i)- 

In general, in real world type II games, a particular human's choice of mixed 
strategy will vary depending on their mood, how distracted they are, etc. Ac- 
cordingly the joint choice of the players is described by a distribution n(q). n 
reflects the stochastic nature of the players, what they know about each other, 
what they know about the game structure, etc. In short, ir is the physical na- 
ture of the setup. In contrast, our (limited) knowledge about tt is embodied in 
a distribution P(tt). 

As usual, we imagine an infinite set of instances of our setup, i.e., an infinite 
set of 7r's, with no information conveyed between instances. The invariant J 
specifies all characteristics of the system — and only those aspects — such that 
we would know if they had been different in some particular instance. Our 
inference problem is to predict it, i.e., formulate the posterior P(tt \ J'). So an 
"entropic prior" concerns the probability of any such joint distribution tt, our 
likelihood must concern tt, etc. Note that the invariant setting the likelihood, 
P(c/ | 7r), is a characteristic of an entire distribution over joint mixed strategies 
(namely n), not (directly) of the joint mixed strategies themselves. 

In both game types, since q is a product distribution, if one is given q then 
knowing one player's move provides no extra information about another player's 
move. (Formally, P(xi \ q,Xj) = P(x% \ q).) However in the absence of knowing 
q in full, knowing just the qi of some player i may provide information about 
other the qj of other players j. For example, this could be the case if those 
mixed strategies qi and qj are determined in part based on previous interactions 
between the players. Similarly, even if the players have never previously inter- 
acted, if there is overlap in what they each know about the game (e.g., they 
each know the utility functions of all players), that might couple members of 
the set {qi}- Accordingly, in type I game theory P(q \ .J?) need not be a product 
distribution (over the qi) in general, and in type II game theory tt need not be 
a product distribution. 

In general, which game type one uses to cast the problem is set by the 
problem at hand. If the players all consciously choose mixed strategies — if 
that's how their thought processes work — then we have a type II game. If the 
players choose moves, we have a type I game. One can even have mixed game 
types, in which some players choose moves, and some choose mixed strategies. 
Our lacking knowledge of what scenario we face is analogous to lack of knowledge 
concerning the payoff structure: our inference problem is not fully specified. 11 

11 Of course, the lack of knowledge underlying both game types can in principle be addressed 
by setting a prior probability distribution over the underlying unknown and defining an as- 
sociated likelihood function. Here that would mean distributions over whether each player 
chooses moves or chooses mixed strategies. No such analysis which would essentially mix the 
two game types is considered in this paper. 
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For the reasons elaborated above, we will adopt the entropic prior for both 
game types. Note that for either game type, the entropic prior evaluated for a 
product distribution is itself a product, i.e., if q(x) — ni%( x i); tncn e<lS ''' — 
Y\ { e aS ( qi \ As a result, by symmetry the associated marginal over x, 



must be uniform over x. 

In some situations P(q \ -J?) will not be of interest, but rather the associated 
posterior 



will be. Now for the entropic prior, we know what the associated prior P{x) 
is (it's uniform). This suggests one formulate a likelihood P(,^ \ x). One 
could then use Bayes' theorem with the uniform P(x) to arrive at the posterior 
P(x | J?) directly, rather than arrive at it via the intermediate variable q. This 
would constitute a third type of game, in addition to the other two presented 
above, in which instances would be x's rather than q's or 7r's. 

Unfortunately, it is hard to see how to formulate the likelihood P{J \ x) 
without employing q or n. Recall that an invariant ,J? is the set of all physi- 
cal instances that can occur in our inference problem, and no other instances. 
However in formulating P(<# \ x) instances are specified by values of x, and for 
almost any inference problem, all x's may occur. So in any such inference prob- 
lem, the associated P{J* \ x) does not exclude any x at all, i.e., it is vacuous, as 
far as inference of x is concerned. It is also hard to see what might be gained by 
using such an alternative game type. Indeed, since it conflates the distribution 
q concerning physical reality with the distribution P concerning our (lack of) 
knowledge about that reality, one would expect substantial losses of insight if 
one used such an alternative game type for one's analysis. 

As a final notational comment, we will use the following shorthand for each 
i's "effective utility", sometimes called i's environment: 







U\ Xl ) ^E{u l \ Xi ). 



(9) 



In type I game theory this reduces to 




U\xi) = E qi (u l \xi)= I dx-i q-i(x-i | Xi)u l (xi, X-i) 



E q {u l ) = qi -U 



(11) 
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when working with type I games. (The expansion of U l for type II games 
proceeds analogously.) 

3.2 Needed Mathematical Tools 

This section presents some preliminary mathematical tools from statistical physics 
that are useful for performing Bayesian analysis of games using entropic priors. 
In essence, these tools amount to a suite of relationships involving the Boltz- 
mann distribution, entropy, and optimization. Though it is a bit laborious to 
work through these tools, they are crucial for understanding bounded rational 
players in general, and for understanding the QRE in particular. We will focus 
on type I games; as usual, similar considerations apply for type II games. 

(1) Start by noting that if we take its logarithm, any distribution qi(xi) can 
be expressed as an exponential of some function over Xi. So in particular we 
can write any MAP qi that way: 

argmax g .P(& | J) cx e ft/l(;El) (12) 

for some appropriate function fa and constant /3j > 0. 

(2) We now relate the formulation of an MAP qi in exponential form (as in 
(1)) to a particular choice for our game theory problem's invariant. 

Say we associate with each player i a "guess" she makes (potentially ex- 
plicitly, potentially not) for her environment function, C7* . Write that guessed 

function as fi(xi). We presume that we can view the player's behavior as though 
she were trying to perform well for that (guessed) environment. Formally, we 
presume that in each instance of our inference problem, the mixed strategy of 
player i results in the same (invariant) value Ki for what E(U l ) would be if 
player i's guess for her environment U l were correct, i.e., if U l equalled fa. So 
the invariant of the game for player i is 

qi • fa = K h (13) 

Intuitively, with this invariant, as one goes from one instance of the inference 
problem to the next, we presume that player i is always just as smart, as mea- 
sured with the (potentially counterfactual) environment function fa. 

For this invariant, the likelihood P(J? \ qi) restricts qi to lie on the hyper- 
plane of distributions obeying Eq. ^| 

P{J | q t ) = S(q z ■ fa - K^. (14) 

So given our use of the entropic prior, the posterior for player i is 

P(q i \S) <xe aS M8( 9i -f i -K i ). (15) 

12 Of course, there is always freedom to absorb some portion of any fSi into the associated 
fi, but that is irrelevant for current purposes. 
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Accordingly the MAP q$ is given by the q maximizing the so-called maxent 
Lagrangian, 

^(*) = S(ft)+A[ft-/i--Ki] (16) 

where in the usual way the /3j are the Lagrange parameters, here divided by 
a. 13 

Solving for the qi minimizing this Lagrangian, qf 1 , we get the distribution of 
Eg. 1121 with /3j a function of Ki. Equivalently, we can take Ki to be a function 
of Pi. This is what we will do below. 

Now parallel the conventionaly nomenclature of statistical physics, and de- 
fine the partition function 

Z fi (Pi) = J date**™. (17) 

(Note that the partition function is the normalization constant of Eq.^J) Then 
using Eq. ^| to express the Boltzmann distribution q?* , our constraint Eq. PHI 
means that 

Km = f,- q , = dH % m as) 

as is readily verified by evaluating the derivative. As shorthand, sometimes we 
will absorb fa into fi, and simply write Z(V) = Zy(l)- 

So say we are given some distribution qi. Take its logarithm to get a function 
fi and exponent 0i . Then use Eq. 1181 to translate that /, and /3j into a value 
of Ki. Using these choices of Ki, fi and j3i, formulate the associated invariant 
J given by Eq. US As shown by Eq. H3 the MAP distribution for that J is 
our starting distribution qi. In this way we can view that J? as the "effective" 
invariant for this (arbitrary) starting q^ We can translate any qi into an MAP 
distribution by choosing an appropriate J? this way. 

(3) We will refer to the function Ki{.) arising in Ea. 1181 which maps /3, to the 
expected value of fi under «&*(.), as the Boltzmann utility for player i, where 
fi is implicit. We now present some general characteristics of the Boltzmann 
utility, characteristics that are particularly important for understanding the 
QRE. 

First, with slight abuse of terminology, we will sometimes write the Boltz- 
mann utility with fi explicitly listed as the first argument and the subscript i 
dropped, i.e., as K(f,/3 S R + ). In this case it is the domain of the first argu- 
ment of the Boltzmann utility K(f, /3) that (implicitly) sets the space Xi to be 

13 As an aside, say that we replaced Eq. 1131 with the inequality constraint qi ■ fi > Ki. 
The entropy function is concave, and so is this inequality constraint. Accordingly, by Slater's 
theorem, there is zero duality gap 47 and we can apply the KKT conditions to get a solution. 
In other words, for this modified invariant the maxent Lagrangian still applies, and therefore 
so does the solution of Eq. 1121 
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integrated over to evaluate K(f, (i). 14 

As is readily verified, the variance (over fi values) of the Boltzmann distri- 
bution of Eq. El is given by the derivative of i£j(ft) with respect to ft. Since 
variances are non-negative, this means that i^(ft) is a non-decreasing func- 
tion. In fact, for fixed fi, so long as fi is not a constant-valued function (i.e., 
not independent of its argument), the associated Boltzmann utility Ki(.) is a 
monotonically increasing bijection with domain ft g [0, oo) and associated range 

[ ^f^ ,max a J^)). 15 

We extend the domain of definition of Ki by adding to it the special value 
"oo*", and defining _?Q(oo*) = max^/^Sj). This makes Ki a bijection whose 
domain is ft £ [0, oo) U oo* when fi{xi) is not a constant, and is the singleton 
{oo*} otherwise. In both cases the range is dx *J*( x ^ ; m&x Xi fi(xi)]. 

With some abuse of notation from now on we will extend the meaning of the 
linear ordering ">" to have oo* > k Vfc £ K. We will will also drop the asterisk 
superscript from "oo*", relying on the context to make the meaning of "oo" 
clear. We will engage in more abuse by writing "6" even if some component 
bi — oo (so that b is not a Euclidean vector, properly speaking). 

Just as expected fi cannot decrease as the Boltzmann exponent rises, the 
entropy of a Boltzmann distribution eP*^^' / Z f i (ft) cannot increase as its Boltz- 
mann exponent ft rises. 16 So the picture that emerges is that as ft increases, 
the Boltzmann distribution gets more peaked, with lower entropy. At the same 
time, it also gets higher associated expected value of fi . 

(4) We now extend the discussion to allow qi not to be qf% the Boltzmann 
distribution over values of fi with exponent ft (the distribution in Eq. I12fl . This 
extension will prove important in quantifying the rationality of a player based 
solely on their mixed strategy and environment fSec. 16.21 below). 

First expand S(qi) for the case where in fact qi does equal the Boltzmann 
distribution q^ 1 . Then using Eg. 1131 we see that for this Boltzmann distribution 

14 Note that despite the terminology, the Boltzmann utility is not a "utility function" in the 
sense of a mapping from x to R. Rather it's what expected utility would be for a particular 
type of mixed strategy, in a particular environment, as a function of parameters of that mixed 
strategy. 

15 To see this, note that the variance is non-zero for all ft < oo, so long as fi(xi) is not a 
constant. Accordingly, under such circumstances -fiTi(ft) is invertible. 

16 To see this say we replace the invariant qi ■ fi = Ki(/3i) with qi ■ fi > -R"i(ft). Then for 
fixed q~i, the MAP qi is the qi that maximizes S(<ji) subject to that inequality constraint 
that qi ■ fi > Ki(f3i). The entropy is a concave function of its argument, as is this inequality 
constraint, so our problem is concave. Therefore the critical point of the associated Lagrangian 
is the MAP qi. Now if we increase ft, and therefore increase Ki, the feasible region for our 
new invariant decreases. This means that when we do that the maximal feasible value of S 
cannot increase. So the entropy of the critical point of the Lagrangian for our new invariant 
cannot increase as ft does. However that critical point is just the Boltzmann distribution 
qi(xi) oc exp(—(3ifi(xi)), i.e., it is the MAP qi for original equality invariant, qi ■ fi = Ki. So 
the property that increasing ft cannot increase the entropy must also hold for the original 
equality invariant. 
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,,. fl + s -M = ^M. (19) 

Pi Pi 

Comparing with Eq.|^in App.|51 we see that the quantity on the left-hand side 
is essentially identical to the free energy of statistical physics. 17 Accordingly, 
we call it the free utility of the player. 

Note that the free utility is a function of qi and ft, and is defined even when 
qi is not a Boltzmann distribution over values of In contrast, the quantity 
on the right-hand side of Eq. ^5] is only a function of ft. For fixed ft, that 
quantity on the right-hand side of Eq. ^3 is an upper bound on the free utility 
of the player. 18 For that fixed ft, the free utility gap of player i is defined 
as the difference between its actual free utility and the maximum possible at 

that ft, ^^J±Mh)l _ That gap is zero — player i's free utility is maximized — 
at player i's associated equilibrium (Boltzmann) mixed strategy. Intuitively, 
player i "tries to" maximize free utility rather than expected utility, insofar as 
it "tries to" achieve its MAP mixed strategy. 

(5) Finally, we present a restriction that simplifies our discussion below of 
how the support of P(q | J?) covers Ax (Sec. 14.71 below). 

We say that a particular q is benign for utilities {u 1 } if for all players i, we 
can write the associated expected utility qi ■ U % q _. as K{U q _ l ,Pi) for a ft > 
with U l q _. defined in terms of u l and q_i in the usual way. In this paper, for 
simplicity we will only consider benign q's. This means in particular that we 
assume that for all players, their expected utility is not worse than the one 
they would get for a uniform mixed strategy (which corresponds to infinite ft). 
While the analysis can be extended to allow negative ft (where player i adopts 
a worse-than- uniform mixed strategy), there is no need for such considerations 
here. 



3.3 Effective invariants and the QRE 

As discussed above, every product distribution q can be specified by saying 
that each of its marginalizations qi is an MAP prediction for some associated 
"guessed environment" /j and ft. But not every q can be expressed this way 
if we demand that the guessed environments fi for each player are their actual 
environments. In other words, only for a subset of all g's will it be the case that 
/i(a*) = fcOVi. 

Demanding such self-consistency in q results in a coupled set of nonlinear 
equations for q. This is the set of equations that specifies the QRE, Eq.[3 19 It 

17 Free energy has a different sign on the entropy. This just reflects the fact that players 
work to raise utility whereas physical systems work to minimize energy. 

18 This follows from the fact that the qi that maximizes the free utility for our ft is just the 
associated Boltzmann distribution. 

19 In |2:')l . Ug_ i is called "a statistical reaction function", and the set of coupled equations 
giving that solution is called the "logit equilibrium correspondence" . 
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was first derived in this manner, as the self-consistent solution to a set of MAP 
inferences, in [TT1I251I2T] . 

Note that there is no particular decision-theoretic significance to the QRE 
derived in this manner. In particular, it is not the Bayes-optimal solution to 
any inference problem. Nor is it derived as the MAP solution to any (single) 
problem. Rather it is given by a set of MAP solutions, each for a separate 
inference problem. There is one such problem for each separate player. We then 
posthoc "tie together" those separate problems, by requiring that our solutions 
to them are consistent with one another. 

Unfortunately, such a two-stage process has no clear justification in terms 
of Cox's and Savage's axioms. More generally, it is hard to formally justify the 
approach of enforcing consistency among a set of separate inference problems 
rather than considering a single aggregate inference problem. (Recall that the 
inference is being done by the external scientist, and that that scientist is ex- 
ternal to the system.) To address that single inference problem, we must use a 
single invariant that concerns the entire joint system. 

In such an alternative to the QRE's posthoc "tying together" one analyzes 
all players' distributions simultaneously, from the very beginning. This means 
that we analyze the distribution over full joint strategies that involve all the 
players. In this approach, to get any particular player i's distribution we would 
marginalize the distribution over joint strategies, rather than (as in the QRE) 
start with those marginal distributions and try to tie them together. 

The natural invariant for this aggregate inference problem is the "aggregate 
invariant" that q% ■ fi = Ki Vi. However as shown below, the QRE is not even 
the MAP of the posterior over the space of joint strategies under this invariant 
(never mind being Bayes-optimal.) An analysis of this aggregate problem is the 
subject of the next few sections. A discussion of the historical context of the 
QRE can be found in an appendix. 

4 Coupled players 

Recall that the posterior is given by the prior and the likelihood. Since (for 
both game types) we've chosen the prior, our next task is to set the (^-based) 
likelihood. We want that likelihood to have the same form as the likelihood 
underpinning the CE of statistical physics: a Heaviside theta function that re- 
stricts attention to a subset of all possible systems, with the distribution across 
that subset then set by our prior (see the appendix). However as elaborated 
below, the likelihood theta function appropriate for games is more complicated 
than the one that arises in statistical physics. This is because there are mul- 
tiple payoff functions in games, each with its own effect on the system's theta 
function, whereas there is only one Hamiltonian in a statistical physics system. 

In this section we illustrate how to set the likelihood in a scenario where the 
players may have knowingly interacted with each other before the current game. 
(In the next section we use these results to address the case where the players 
have not previously knowingly interacted.) In general those previous interac- 
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tions are allowed to vary from one instance to another; the invariant restricting 
our instances will also be what restrictsfg the possible previous interactions. 

4.1 Invariants of human players 

In general of course, J? does not specify everything about our inference problem, 
and in particular it does not specify the value of that which we want to infer. 
Here what we wish to infer is the actual joint strategy of the players. (See 
Sec. 12.31 ) So the joint strategy is not specified by J ' . Therefore a player's 
payoff is also not speciefied for any particular one of its moves, since that payoff 
will depend on the moves of the other players in general; that payoff may vary 
between instances. 

Instead, here we stipulate that any player will try to maximize her expected 
utility, to the best of her computational abilities, the best of her insights into 
the other players and the game structure, etc. 20 Intuitively, this means we as- 
sume a "pressure" embodied in the distribution over g's biasing the distribution 
to have qi that achieve high values of U' l q ■ qt. This pressure is matched by 
counterpressures from the other players affecting U % q . 

What we consider invariant is that from one instance to the next player i 
does not change, and therefore how insightful player i is into the other players 
(based on her previous interactions with them), how computationally powerful 
i is, etc., does not change. In other words, how well player i performs, in light 
of her (varying) environment of possible payoffs (i.e., in light of U l ) is the same 
in all instances. In short, "how smart" every player i is does not change from 
one instance to the next. As in the case of statistical physics though, here our 
invariant need not specify precisely how smart each player is a priori, only that 
how smart each of them is doesn't vary from one instance to the next. 

As an example, consider the situation where the players knowingly are re- 
peatedly playing the game with each other, forming a sequence of games. Say 
we are considering the distribution over joint mixed strategies q at some fixed 
(invariant) sequence index t. In this scenario it is an entire sequence of games 
leading up to game t that constitutes "an instance of our inference problem" . 
We must determine what is invariant from one instance of that problem to 
another. 

Note that in game t of any instance the players' actual moves are indepen- 
dent, tautologically. (This is reflected in having q be a product distribution.) 
However in general the q at t will change from one instance to the next. In- 
deed, consider any time t' < t. At that time, in every sequence of games, each 
player modifies its mixed strategy based on the history of move-payoff pairs 
in that sequence for times previous to t', i.e., each player tries to learn what 
strategy is best based on its history and adapts its strategy accordingly. Since 
the move-payoff pairs of that history are formed by statistical sampling (of the 
joint mixed strategy), they will not be the same in all sequences. Accordingly, 
in general the modification i makes to its mixed strategy at t' will not be the 

20 This is not the case in situations like Allais' paradox; see below. 
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same in all sequences. Therefore the final joint mixed strategy q will vary from 
one sequence to the next. 

As a result of this sampling, across the set of all instances (i.e., all sequences) 
there will be some statistical coupling between the time t mixed strategies of 
the separate players. This means in particular that in general the time t MAP 
q, argmax g tP(<7* | invariant), is not a product of the individual time t MAP qi, 

nargmax ?it P( 9 ,n invariant) = -g-x J d&PfeJ, d, I invariant) 

i i 1 J 

^ argmax q P(q t | invariant) (20) 

(This is in contrast to the case with independent players considered in Sec.|SJ). 

Since the final q varies across the instances, in general we can't expect that 
for each player i, its environment U l will be the same at the end of each sequence. 
Indeed, even consider the case where play evolves to a Nash equilibrium at time 
t. If the game has multiple such Nash equilibria, then in general which one 
holds for a particular sequence of games will depend on the history of moves 
and payoffs in that sequence. Accordingly, U l will depend on that sequence. 

Formalizing all this means formalizing our invariant J? of "how smart" a 
player is. Here we consider how to do this for type I game theory, where inference 
is of q. The discussion for type II game theory proceeds mutatis mutandi. 

Consider just those instances of our inference problem in which player i 
is confronted with some particular vector of move-conditioned expected utility 
values, U l . We say that that i is "as smart" in any one of those instances as 
another if in each of them separately, on average, the move i chooses has the 
same payoff. In other words, how smart i's is is the same in all of those instances 
if i's expected utility, qi ■ U l , has the same (potentially unknown) value in all 
of them. We write that value as €i(U l ). As an example, at a Nash equilibrium 
£i{U l ) = max. Xi U l (xi) 

Note how conservative this restriction on qi is. In particular, so long as 
£i{U l ) < m.Bx. Xi U l {xi) Vi, then we are guaranteed that multiple qi satisfy this 
restriction. This is true even if the game has only a single Nash equilibrium. 

Our invariant is simply that the functions {ei} are the same in all instances. 
This invariant does not concern the joint choices (moves) of the players across the 
instances (which is given by the x's). Rather it concerns q, which is the physical 
nature of the process driving the players to make those choices. However the 
invariant does not specify that process. In particular it does not stipulate how 
the players reason concerning each other. For example, it does not stipulate 
how many levels of analysis of the sort "I know that you know that I know that 
you prefer ..." any of the players go through (if any levels at all). All that 
.y stipulates is that certain high-level encapsulations of that decision-making, 
given by the {q}, are the same in all instances. 

As a result of this invariance, even though the moves {xi} of the players 
are independent in any particular instance (since q is a product distribution), 
our (!) lack of knowledge concerning the set of all the instances might result in 
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a posterior P(q \ J?~) in which the distributions {q{\ are statistically coupled. 
(Recall that q reflects the players, and P reflects our inference concerning them.) 

Note that for the entropic prior P(q) there is no statistical coupling be- 
tween Xi and Xj in the prior distribution P(x). (Recall that for that prior, 
P(x) = j dqP(x | q)P(q) must be uniform, by symmetry) However the po- 
tential coupling between the {q{\ means that in the posterior distribution, the 
moves are not statistically independent (assuming one doesn't condition on q). 
Formally, 



P(x t | Jf) = I dq t P(x, | qi )P( qi | J) 

dqi q t (x l )P(q l \ J) (21) 



so 



Y[P(xi | J) = [dq Y[P( qi | ^)«i(x<). (22) 
On the other hand, recall that 

P(x | f)= J dq P(q | J) Y[ qi { Xi ). (23) 

i 

So if P(q | if) is not a product distribution, then in general P(x \ J?) ^ 
Yl t P(xi | J), i.e., in this situation P(x \ ^) — which is the distribution 
over joint moves reflecting our understanding of the system — is not a product 
distribution either. In such a situation, to us, Xi and Xj are statistically coupled. 

Such coupling also typically arises in the Bayes-optimal prediction for the 
distribution over joint strategies. Indeed, say we adopt a quadratic loss function, 
so that if we guess the joint distribution is q" , when in fact it is q' , our loss is (q" — 
q') 2 . Then given the posterior P{q \ the associated Bayes-optimal prediction 
for q — the prediction that minimizes our posterior expected quadratic loss — 
is 



Pquad = I dq qP(q \ J). (24) 

This is the same as the joint mixed strategy given by Eq. (This is not the 
case for other loss functions.) Accordingly, our conclusion about coupling of the 
{xi} holds for this Bayes-optimal joint mixed strategy. 21 

Consider changing the cognitive process of some player j ^ i in way that does 
not change Cj. Also do not change anything concerning all the other players. Do 
all this in such a way that how that player j chooses moves at time t changes, 
but nothing else changes about j's behavior. So in particular, for any fixed 
vector U l , the qj that governs player j at time t and is consistent with that 

21 Note the slight abuse of terminology; the moves of the players are statistically coupled 
in this "joint mixed strategy", which is why we do not write that Bayes-optimal distribution 
over x as q but a p. 
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U l will change. 22 Now the distribution over possible qi at time t is based on 
behavior of player i and of other players for times t' < t. Since those factors are 
unchanged by our change to qj at time t, so is the distribution over possible 
then. Accordingly the change in qj will in general change the expected utility of 
player i at time t. In other words, changing player j ^ i will in general change 
ej. This illustrates that our invariance is implicitly determined by the set of 
players as a whole. This is in addition to its reflecting how the players have 
interacted, the structure of the game, etc. 23 

4.2 Specifying the function q 

Now in general for any player i, our invariant doesn't force all instances to have 
the same vector U % . So to complete the quantification of how smart a player is 
we need to specify the function e{. To do this we use a Gedanken experiment; we 
consider how player i would behave in a counterfactual "game against Nature" 
inference problem. In that new problem we focus on just one player i, fixing the 
others. Formally, our invariant is expanded from that of the original problem, 
to a new invariant J? 1 that also include U l . Since the invariant still stipulates 
that E(u l ) = €i(U l ), having U l also invariant means that the expected utility 
E(u l ) does not change between instances of this new problem. 

Write the (potentially unknown) value of that invariant expected utility as 
Since we use the entropic prior over q^ this new inference problem has the 
usual entropic posterior. Also as usual, the MAP q^ is given by a Boltzmann 
distribution: 

q*(xi) oc e biUi ^ (25) 

where the Lagrange parameter going into bi enforces the constraint, namely that 
q* ■U % — Vi. (See Sec. 13.21 for the more general way that this constraint arises 
and some useful equalities relating bi, Vi, etc.) 

We must now consider how bi changes as U 1 changes. The lowest order case 
is where bi is a constant, independent of U l . This means that for real- valued bi, 
e, is identical to the Boltzmann utility K{ discussed in Sec. 13.21 with U l playing 
the role that /j does in the definition of Boltzmann utility, and bi playing the 
role of Pi. Just as we extend the domain of definition of -fQ(.) to include oo, we 

22 We mean "consistent" in the sense that even though qj^i has changed, it is still true that 
U\xi) = J dx'_ i u % {xi,x'_i)q-i(x'-i) 

= j dx'jdx^^ ^ «'(ii,4,J:'_ {ij })?i(4)ij_ { ,j } (^ {lJ }). 

23 Note how problematic it would be to try to encapsulate our invariance in a traditional, non- 
Jaynesian "bath"-based approach to statistical physics. In such an approach, the invariants 
are the sums, across both the system under consideration and an external "bath" , of certain 
physical quantities. In other words, the aggregate amount of those quantities across the system 
and the external bath is taken to be conserved. For example, the CE arises if one takes the 
aggregate energy to be conserved. It is not at all clear how one could express our invariants 
as the values of such conserved quantities. 
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do the same for and for q*: For bi = oo, g* is the distribution that is uniform 
over the set axgmB,yi Xi U 1 {xi), and zero elsewhere. 

Below we will use the shorthand q*(x) = Yii Qi( x i) where for each i the U l 
arising in the definition of q* is understood to be based on the q*_ i (i.e., each U l 
means U l q »). So the definition of q* reflects coupling between the player's mixed 
strategies (though not necessarily between their moves): a change to q* for 
some particular j in general will modify the strategies ql^j_q* is the Quantal 
Response Equilibrium (QRE) solution, discussed in Sec. (3.21 In general, for 
any particular game and b, there is at least one, and may be more than one 
associated q* . This follows from Brouwer's fixed point theorem |231 125| . 

As a point of notation, the expression J^r is defined to be the invariant that 
Mi,qi-Uq — K(U l q where it is implicitly assumed that b>Q. For any 

such b there is always at least one q that satisfies (e.g., the QRE). 

4.3 The impossibility of a Nash equilibrium 

Say that for the coupled players invariant, J , (the support of) P(q \ <#) is 
restricted to the Nash equilibria of the underlying game, so that the players 
are perfectly rational. (See Sec. 14.71 below.) Say that there are multiple such 
equilibria, written q 1 , q 2 , . . ., with P{q \ J) = J^ . a? 'S(q — qi). So the a 1 form 
a probability distribution over the equilibria. Since the entropic prior extends 
over all q £ Ax , in general none of the a 1 will equal zero exactly. 

Since the equilibria are all product distributions, using Eg. 1231 we can write 

p(x\s) = xyn ( 26 > 

3 k 

so that 

P{x l | J) = J dx^ P(x | J) 

3 

Consider the case where the Nash equilibria are not exchangeable, so P(q \ if) 
is not a product distribution. This means that P(x \ J^) is not a product 
distribution in general, so that the players appear to be coupled, to us. (See the 
discussion just below Eg. 1231) 

At an intuitive level, such coupling is analogous to the consistency-among- 
players coupling that underlies the concept of a Nash equilibrium. However 
because it mixes the Nash equilibria with each other, in general the sum in Eo. 1271 
is not a best response mixed strategy for the product distribution Y[j=a P( x j I 
J?). Formally, p l (xi) = P(xi | does not maximize the dot product 

J dxi p\x t ) [ J dx^ u{ Xi , x-i) Y[ P(xj | J)\ . (28) 
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Similarly P(xi \ J?) is not a best-response mixed strategy for P(x-i \ J 1 ). So 
when the underlying game has multiple non-exchangeable equilibria, then even 
if the players are perfectly rational, in general we will not predict distributions 
governing the moves of the players that are best-response mixed strategies to 
each other. 

Note that this conclusion does not depend critically on our choice of e^, or 
even on our choice of encapsulating in terms of such functions e^. (After all, 
we're explicitly allowing the case where P(q \ is restricted to Nash equilib- 
ria.) Rather it comes from the fact that our prior allows non-zero probability 
for all of the Nash equilibria. 



Unlike the usual motivation of the QRE, the motivation for our choice of ti 
does not say that qi must be a Boltzmann distribution. It does not say that the 
probability distribution over possible qi is a delta function about a Boltzmann 
distribution qi. Rather it says that g*, the most likely qi for the single-player in- 
ference problem, is a Boltzmann distribution. It then uses that fact to motivate 
a functional form for ej in the multi-player scenario. Here we only assume that 
the relation between E q (u l ) and U l given by that functional form is consistent 
with qi = q* . In general the invariant E q (u % ) = €i(U l ;bi) holds for many qi in 
addition to the Boltzmann distribution. 

Indeed, fix q, and consider any i and the associated . Recall that we are 
restricting attention to benign q's (cf. the discussion at the end of Sec. VS. 2|) . So 
no matter what it is, our qi is consistent with our invariant for that U q _., for 
some hi. Since this is true for all i, any q is consistent with our full invariant 
for some b. Furthermore, for any finite a, the support of the entropic prior is 
all Ax- This means that every q has non-zero posterior probability P(q \ 
for some b. 

In contrast not every is a Boltzmann distribution, i.e., not every qi is 
part of a QRE. In other words, to assume a system is in a QRE is to make a 
restrictive assumption about the physical system q, an assumption that may or 
may not be correct. This is not the case with our invariant. 

Finally, it turns out that the QRE can be viewed as an approximation to the 
MAP prediction for our J? . A detailed discussion of this is presented in Sec. 14.61 
below. 



4.4 The QRE and e { 



4.5 The MAP q 



Given our invariant, our likelihood is 



P{J | q) 



Y\5{E q {u*) - (Uj)) 



% 



(29) 
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Recall that with the canonical ensemble the likelihood stipulates a linear con- 
straint on the underlying probability distribution. In contrast, due to the non- 
linearity of 6i, here the likelihood stipulates a non-linear constraint on q. 

As usual, if we wish we can distill the associated posterior into a single 
prediction for q, e.g., into the MAP estimate. Naively, one might presume that 
q* is that MAP estimate. After all, q* respects our constraints that E q (u l ) = 
e i{Uq) Vi, and it maximizes the entropy of each player's strategy considered in 
isolation of the others. However in general q* will not maximize the entropy 
of the joint mixed strategy subject to our constraints. In other words, while 
MAP for each individual player's strategy, in general it is not MAP for the 
joint strategy of all the players. The reason is that setting each separate qt to 
maximize the associated entropy (subject to having q obey our invariant), in a 
sequence, one after the other, will not in general result in a q that maximizes 
the sum of those entropies. So it will not in general result in a q that maximizes 
the entropy of the joint system. 

Proceeding more carefully, the MAP estimate of the mixed strategy q is 
given by the critical point of the Lagrangian 

if (q, {Xi}) = S(q) + • U l - e^W)) (30) 

i 

where the are the Lagrange parameters enforcing the constraints provided 
by the likelihood function of Eq. |2H1 The critical point of this Lagrangian must 
satisfy 



= 



dqtixi) 

-1 - Into)] +hE(u i | Xi ) +Y,Xj[E(u j | x^ - J4^y] 

-1 - ln^Zi)] + XiE(u l | Xi) + 

dej(W) dW( Xj ). 



^2^j[E(u J 1x0-/ dx i 



dW(xj) dqi{xi) 



= -1 - \n\qi(xi)\ + XiEiu 1 | Xi) + 

Y.XAEW \xi)-J d Xj ^^rE(u^ | Xi , Xj )}. (31) 

Accordingly, at the MAP solution, for all players i, 

, . XiB(« i Ni)+E^A i [BK|xO-/^ J |^i^£; 9 (^|x i ,x ; ,-)] , 00 , 

qi[Xi) oc e 3r buh Xj ) (32) 

This is a set of coupled nonlinear equations. The solution will depend on 
the functional form of each ej. The form being investigated here is Boltzmann 

utility functions, so we must plug that into Eq. |^to evaluate §^fjj^- After 
doing that, interchange the order of the two differentiations, to differentiate with 
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respect to U 3 (xj) before differentiating with respect to bj. Carrying through 
the algebra one gets 

= q*( Xj )[l + b 3 {E q M | x,)-E qhq (v?)}] (33) 

J —3 J —J 

We must now plug this into the integrals occurring in Eq. 's El and [2JJ Each 
such integral becomes 

dxj q*{xj)E q {u J I x h Xj) [1 + bjiEq .iu 1 \ x 3 ) - E q * jQ _ . (V)}] 

(34) 

Together with the constraints {E q {u^) = Cj(U 3 )}, Eg. 1321 now gives us a set 
of coupled nonlinear equations for the parameters {Xj} and the qj. The solution 
to this set of equations gives our MAP q. 



4.6 The relation between the MAP q and the QRE 

Ultimately the only free parameters in our solution for the MAP q are b. The 
QRE solution q* is also a set of coupled nonlinear equations parameterized by 
b. In general there is a very complicated relation between the MAP q(x) and 
q*(x), one that varies with b (as well as with the {u J }, of course). In particular, 
in general the two solutions differ. 

Intuitively, the reason for the difference between the two solutions is that 
each player i does not operate in a fixed environment, but rather in one con- 
taining intelligent players trying to adapt their moves to take into account i's 
moves. This is embodied in the likelihood of Eq. [23] In contrast to that like- 
lihood, the likelihoods of the QRE each implicitly assume that the associated 
player i operates in a fixed environment. 

Formally, if we make a change to qi, then the likelihood of Eq.[3j5]will induce a 
change to to have the invariant for the players other than i still be satisfied. 
This change to q_i will then induce a "second order" follow-on change to qi, to 
satisfy the invariant for player i. This second-order effect will not arise in the 
likelihood associated with the QRE q* , which treats the other players as fixed. 

Note that with the likelihood of Eq. [531 the second-order effect will induce a 
further change to q-i, to ensure the invariant is still satisfied, which will then 
cause a third order change to qi, and so on. This back-and- forth is a direct 
mathematical manifestation of the "I know that you know that I know that you 
prefer ..." feature at the core of game theory. This is the phenomenon that 
distinguishes game theory as a subject from decision theory. The difference 
between the QRE and the MAP q is an encapsulation of this distinguishing 
feature. 
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There are other ways to view the intuitive nature of the relationship be- 
tween the QRE and the MAP q. For example, in deriving the MAP q one 
follows standard probability theory and multiplies likelihoods concerning the 
separate players to get a likelihood concerning the full joint system. The mode 
of (the product of the prior and) that joint likelihood gives the single most likely 
solution to our inference problem. In contrast, the QRE q starts by separately 
finding the most likely solutions to each of many different inference problems 
(one problem for each player). It then multiplies those solutions concerning 
different problems together. It is not apparent what justifying formal argument 
(i.e., one based on Savage's axioms) there is for taking that product of solutions 
of different problems as one's guess for the solution to a single joint problem. 

The mathematical relationship between the QRE and the MAP q is a com- 
plicated one. Here we consider the simplifying approximation that under the 
integral of Eq. 021 we can equate q(x) and q*(x). In other words, assume we 
can use the mean-field approximation within integrands. Exploiting this, we 
can evaluate the integral in Eq. [21 

f de 3 (W) 3 
J dW(x) \ x ii x j) 

<f " ] {x 3 )E r {u : > | X i ,X j )bj{E r {u : ' I Xj) -Eq.iu 1 )}] 

= E q (u j | x t ) - 
bj[E q *{u?)Eq(u° | Xi) - J ' dxjqj(Xj)E q *(u j | Xj) Eq^u 3 | x j: Xi)] 

(35) 

where we have used the fact that q is a product distribution. Plugging this into 
Eq.EU gives 

= -1 - lufefo)] + \E q {u l | Xi) + 
J2Wj[E q *(u j )E q ,( U j [xj- 

J dx j q* j (x j )E q ,(u j | Xj) E q .{v? | Xj,Xi)] (36) 

as our equation for qi in terms of q_ i and q*. 
So consider the situation where for all j, 

Eq*(u j )E q ,(u j | Xi) = J dx-jq* j(xj)E q ,(u j I Xj) E q ,(u j I Xj,Xi)]. (37) 

In this situation, in light of Eq. 02 we recover for the MAP q the very QRE 
solution that we assumed when we made the mean-field approximation, where 
bi = Xi Mi. Accordingly, if the QRE solution obeys Eq.|23 it is an MAP solution. 
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If the QRE only approximately obeys Eq. [37| then the exact MAP solution can 
be found by expanding around the QRE via Eq. |2U 

The difference between the two sides of Eq. %57\ is a covariance, evaluated 
according to q*, between the random variables E q *(u 3 \ Xj,Xi) and E q *{u° \ 
Xj). 2A Comparing Ea.'sl34land 1321 this provides the following result concerning 
our mean-field approximation: 

Theorem 1: The QRE q* is the MAP of P(q \ J) with the vector equality 
A = b iff Vt, 

53(fy) 2 Cov,.[£<r(u j I x, 1 x l ),E q ,{v? | xj)\ 

is independent of Xi, where Cov is the covariance operator: 

Cov p [a(y),b(y)] = J dy p(y)a(y)b(y) - J dy p(y)a(y) J dy p(y)b(y). 

Particularly for very large systems (e.g., a human economy), it may be that 
E q t (u J | Xj,Xi) = E q *(u J | Xj) for almost any i , j and associated moves Xi,Xj. In 
this situation the move of almost any player i has no effect on how the expected 
payoff to player j depends on j's move. If this is in fact the case for player i and 
all other players j, then the covariance for each j, xj, Xi reduces to the variance 
of E q * (u J | xj) as one varies Xj according to q* . 

This variance is given by the partition function: 



Wst^(Eg.(vP \ Xj )) = Var^(^) 



2 



W^. (38) 



In particular, for bj — > oo — perfectly rational behavior on the part of agent 
j — the variance goes to 0. So if every i is "decoupled" from all other agents, 
then in the limit that all such agents become perfectly rational, the expression 
in Thm. 1 generically goes to 0. (The ^--dependence in the covariance occurs 
in an exponent, and therefore generically overpowers the (bj) 2 multiplicative 
factor.) So the QRE approaches the MAP solution in that situation. 

On the other hand, if the players have bounded rationality, their variances 
are nonzero. In this case the expression in Thm. 1 is nonzero for each i,Xi. 
Typically for fixed i the precise nonzero value of that variance will vary with x%. 
In this case, by Thm. 1, we know that the QRE differs from the MAP solution. 

There are many special game structures (e.g., zero-sum games) in which one 
can make some arguments about the likely form of the sum in Thm. 1. An 
elaboration of those arguments is the subject of ongoing research. 

24 Note that that second random variable is just the average (according to q*) of the first 
one. So we can rewrite the covariance another way, as a covariance evaluated according to 
q* (x'j)q* (x'J, between the random variables E q * (n J | x'^Xi) and E q *(u J \ x'px'^j. 
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4.7 The posterior q covers all Nash equilibria 

Not all q can be cast as a QRE for some appropriate b. So in particular, 
a q that occurs in the real world will in general differ, even if only slightly, 
from all possible QRE's. This can be viewed as a shortcoming of the QRE (a 
shortcoming that applies to all equilibrium concepts with a sufficiently small 
number of parameters). 

Now as b — > oo, the QRE reduces to some mixed strategy Nash equilibrium. 
Different sequences of the b going to the infinity vector can lead to different 
Nash equilibria. However in general starting from the point where all bj = 
and continuously increasing the components of b can only lead to one particular 
equilibrium, and other Nash equilibria are not the limit of such a sequence |23| . 
This too can be viewed as a short-coming of the QRE. 

However from the perspective of PGT, there is far more to the posterior 
distribution specified by a particular vector b than some single q chosen using 
that posterior, be that q the associated Bayes-optimal q, the MAP q, or an 
approximation to the MAP q like the QRE. In this, the potential impossibility 
of one particular sequence of such g's approaching some particular one of the 
game's Nash equilibria is not necessarily a reason for concern. 

To formalize this we start with the following result: 

Proposition 1: Define Q(b) = {q G A x : Vi, P{q l Jp) > for some b' h &}• 
Let B be some sequence of b values that converges to do, i.e., such that for all 
b G B having no infinite components, 3 b' G B where b' >- b. Then all members 
of ng 6B <2(&) are Nash equilibria of the game. 

Proof: Hypothesize 3 q G P\^ eB Q(b) which is not a Nash equilibrium. Then 3 i 
such that U~ is not constant valued. In addition, we know that q\ ■ U~_. = Vi < 
max Xi U~_. (xi). However recall from Sec. l3.2l that if i/| is not constant-valued, 
the Boltzmann utility K(U^_.,.) is a monotonically increasing bijection with 

domain [0, oo) and range [- — j ^rj- — ^-,m&x Xi U~_.(xi)). Since v% falls within 
that range, this means that we can invert K(, U~_.) to get a unique finite value 

bi that is consistent with q. Accordingly, \ q) is non-zero only if bi = hi, 

and therefore so is P(q \ J*^). 

However by definition q must be a member of Q(b) for all b in the limiting 
sequence. That means in particular that it must be a member of Q(b') for some 
b' where b\ > 6j. But by definition, all members q of Q(b') have P(q | J^) > 
for some b such that bi>b' i > bi. Since we know P(q \ J?^) is non-zero only if 
bi = bi, this means that q g" Q(«)> contrary to hypothesis. QED. 

Conversely, every q has a non-infinitesimal posterior probability (density), 
for some (potentially infinite) b that specifies that posterior. More formally, 
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Proposition 2: For any benign q G Ax there is a unique b and associated 
invariant & such that P(g | ^g) 7^ 0. For that b, for all g' 6 A^, 



^(<Z I ^) 



> |X|~ Q (39) 



p(q' 1 y~ b 

where a is the exponent of the entropic prior. 

Proof: First recall that any q has non-zero posterior probability P(q \ J*A for 
some &, assuming finite entropic prior constant a. (See Sec. 14.31 ) So to prove 
the first part of the proposition we must establish the uniqueness of that b. 

Consider any i and the given q. Say qi ■ U\_ i = Vi ^ max Xi U^_.(xi). This 
means that U l q _. is not the constant function that is independent of its argument. 
Now recall from Sec. 13.21 that for any such m and fixed non-constant W, there 
is always a unique G [0, 00) such that Ki(bi) equals V{. On the other hand, 
as explained in the discussion in that subsection, if = max Xi [/*_ . (xj), then 
regardless of whether U*_. varies with its argument, bi — 00. So there is a unique 
bi consistent with q, which we write as b*. Since this holds simultaneously for 
all i, the entire vector b* with components {b*} is unique. 

This means that the likelihood P(^» \ q) — 1. On the other hand, P(^£. | 
q') < 1 for any q' . Accordingly, the ratio in the proposition is bounded above 
by the ratio of the exponential prior at q to that at q' . However the ratio of 
e aS(q ) De t ween anv two points q" is bounded below by — ex p( a -°) _ QED 

exp(alH(\X\)) 

In particular, this result holds for Nash equilibrium q; such equilibria arise for 
b = 00 The relative probabilities of those Nash q are given by the ratios of 
the associated prior probabilities, i.e., by (the exponential of) the associated 
entropies, S(q). This reflects our presumption that it is a priori more likely 
that the adaptation/learning processes that couple the players results in a Nash 
equilibrium with broad q that that it results (for example) in a "golf hold" pure 
strategy q. (Generically, such golf hole solutions are more difficult to find for 
any broadly applicable learning process.) 

Prop. 2 also holds for any particular q infinitesimally close to one of the 
Nash equilibria. In this sense, the posterior probability is arbitrarily tightly 
restricted to any one of the Nash equilibria for some appropriate b. 

The picture that emerges then is that V£>, 3 proper submanifold of Ax that 
is the support of the posterior. There is no overlap between those submanifolds 
(one for each 6), and their union is all of Ax, including the Nash equilibria 
g's (for which b = 6b). Those Nash equilibria are the limit points of those 
submanifolds (in a sequence of increasing b). 

Within any single one of the submanifolds no q has too small a posterior (cf. 
Prop. 2). This is because all q within a single submanifold have the same value 
(namely 1) of their likelihoods. Accordingly, the ratios of the posteriors of the 
g's within the submanifold is given by the ratios of (the exponentials of) the 
entropies of those g's. 
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Finally, consider the case that the submanifolds get a unique maximum as 
b — > oo. This means that the mode of the posterior — the MAP q — necessarily 
goes to a single one of the Nash equilibria in that limit. In this sense, "only 
one Nash equilibrium is picked out by that limit" . In particular, this limiting 
behavior holds for the QRE approximation to the MAP q. As mentioned, this 
has been seen as a problematic aspect of the QRE equilibrium concept. However 
from the prospect of PGT there is nothing untoward about this behavior. After 
all, all of the Nash equilibria have non-zero posterior in that limit (cf. Prop. 2); 
it just so happens that the QRE ends up at a single one of those equilibria. 

4.8 Alternative choices of 

Of course, one can always design "learning" algorithms for players to follow in 
such a way that our assumed invariants don't hold. After all, in the extreme 
case you can design "learning" algorithms that are intentionally stupid, giving 
higher probability to moves with lower expected utility. Less trivially, there are 
many algorithms that arc of interest in the game theory community even though 
they would never be considered by anyone in the machine learning community 
applying learning algorithms to real- world problems (e.g., ficticious play). It 
may well be that such algorithms don't obey the assumed invariants exactly for 
some {U 1 }. 

However this issue also obtains, at least as strongly, for alternative encapsu- 
lations of rationality like Nash equilibrium, trembling hand, quantal response, 
etc. It is trivial to design "learning algorithms" that guarantee that those equi- 
libria cannot arise. More generally, in all statistical inference — in other words, 
in all of science — any formalization of invariants may well have some error. This 
is even true in statistical physics, and is an intrinsic feature to any predictive 
science. 

All of this notwithstanding, there are a number of alternative choices of ej to 
the one considered here that should be investigated in depth. To give a simple 
example, the J considered here is only a "lowest order" choice for an invariant. 
In particular, as mentioned above, our choice of J assumes bi is independent 
of U l . A more sophisticated analysis than can be fully developed here would 
consider possible couplings between frj's and U l, s; in this paper &i's and U l, s are 
independent. 

However one does not need to couple &i's and U l, s to get reasonable alter- 
native choices of e^. For example, due to the [^-independence of bi with our 

whatever bi is, for some q the associated likelihood \ q) = (just 

like whatever the temperature of a physical system, some phase space distribu- 
tions are incompatible with that temperature). To avoid this, in many scenarios 
we might want to allow how smart a player i is to vary from one instance to 
the next, even without considering detailed mathematical structures relating 
variations in bi with those in U % . 

One way to do this would mean allowing bi to vary in an [^-independent 
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manner, with only its average value fixed. 25 This simple generalization of can 
be accommodated by switching the analysis to involve type II games. Although 
the details of that analysis (like all other details of type II game theory) is 
beyond the scope of this paper, it is worth making some broad comments on it. 

That type II analysis starts by extending the definition of an environment 
vector to type II games in the obvious way: indexed by q' i: the type II environ- 
ment is defined by 

KM) = j '^^Wj'-iW^yt^O (40) 

so that the expected value of u % is given by 

Ti-KL = J dq[M)KM) 

= E^_Xu l ) (41) 

The analysis also extends the definition of K(., .) to type II games in the obvious 
way: K(U\_ ,E>i) is what tti • U\_ . would be if 7Ti were the associated Boltzmann 
distribution, 7Tj(gQ (x exp(B i U^_ i (q' i )). 

The new invariant would then be that for all i, 

■ Ui_. = Ktft^Bi) (42) 

This invariant allows any qi to occur, thereby allaying the potential shortcoming 
with the invariant this paper focuses on. It is now certain Hi that are excluded 
rather than certain qi. 

To motivate our next alternative for the invariant J' ', consider the distri- 
bution induced by q^i(x^i) over player i's move-specified utilities u l (xi : .) (one 
such distribution for each Xi), 

P q _ i (u i (x i ,.)=u) = J dx'_i q-i{x'_i)6(u - u^x^x^)). (43) 

J* implicitly assumes that those aspects of i's behavior that it is safe for us to 
presume are only those that involve the first moments of these distributions, 

U\_\x l ) = j ' du P q _,(u l (xi, .) = u)u 

= JdaS_ i q- i ( X >_ i )u i (x i ,x , _ i ). (44) 

25 Indeed, in practice each bi is at best loosely known. So formally speaking it is a random 
variable with its own distribution, and so even within a type I game it must be marginalized 
out to get our posterior P(q \ J^). This type of random variable is known as a "hyperpa- 
rameter" |3§ 38 . (A more common example of a hyperparameter is the typically unknown 
width of a Gaussian noise process that corrupts some data.) In particular, in almost all of 
this paper we are implicitly assuming that the posterior over each bi is quite peaked, so that 
in our analysis we can simply set bi to a constant, albeit an unknown one. 
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In this it simply emulates conventional game theory. 

However in many real-world coupled-players scenarios the higher moments, 
reflecting the breadth and overlaps of the distributions over u l (xi, .), will have 
a major impact on our inference of qi. Intuitively, if those distributions — each 
a function purely of g_i — maintain the same mean but get broader with more 
overlap between them, that will increase the variability of what inferences i 
makes concerning those means and their linear ordering. (For example, that is 
the case if i makes its inference of those means based on empirical samples of 
the distributions.) This will make our associated distribution over qi broader 
- there are more qi that we can conceive of i arriving at. Similarly, such 
broadening of the distributions over the u l (xi,.) would often be evident to i. 
That might make i realize it can have less confidence in its inference of the 
ordering of the means of those distributions. In such a situation, many real- 
world players i would become more conservative in formulating their mixed 
strategy, qi. So not only might the distribution over qi get broader, but it may 
also shift, if g_i changes to cause this kind of broadening of the distributions. 

To be more quantitative, say the variances of the U l (xi, .), 

V l q _\xi) 4 {fdx-i q^(x^)[u\x t ,x^)] 2 } - [U^ixi)} 2 , (45) 

are increased, and that the overlap between the distributions over each Ui(xi, .) 
(measured for example via Kullback-Leibler distance between those distribu- 
tions) also are increased. Then there is often increased uncertainty on our part 
about the relationships between z's sample-driven preferences among the Xj. 
This often means we are less sure in our inference of what i's current mixed 
strategy is, which means our posterior over qi should get broader. 

In addition, under such broadening in the u l (xi, .) there is increased uncer- 
tainty about what i's best move would be for the actual move X-i that will be 
formed by sampling q—i(x-i). Typically this means that the information that i 
has gleaned via its previous interactions with the other players is not as helpful 
to i for determining its best move for the current game. Intuitively, when these 
distributions are broader i faces worse signal-to-noise in discerning the relation 
between the U l (xi) based on limited data. This will often manifest itself by 
changes to what mixed strategy i is most likely to adopt. 

A standard illustration of both of these effects arises if one compares two 
extreme scenarios. The first is the "US economy game" that any particular US 
citizen i repeatedly engages in with the 300 million other human players in the 
US. The second is a simple game against Nature that i repeatedly engages in 
where there is no variance in Nature's choice of move. Our inference of qi is 
far easier in the second scenario. Similarly, typically i will have an easier time 
discerning its best move in the second scenario. 26 

One approach to incorporate such effects would be to have the set of all 
{V l (xi)} (running over i as well as the associated Xi) and overlaps between 

26 See 11 48 25 15 49 50 , and references therein to "Collective Intelligence" for a discus- 
sion of how this second type of effect can be addressed for mechanism design and distributed 
control. 
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the distributions over the {u l (xi,.)} help specify b. Such an approach could 
obviously address the second of the effects we're concerned with, involving how 
much information i has managed to glean concerning the other players. It is 
not a fully satisfactory approach to addressing the first effect however. This 
is because once b is set — however that is done — some q are excluded, i.e., 
some q have posterior probability equal to 0. Typically to change b to allow 
those previously excluded q — and thereby broaden the distribution over qi 
— the Bayes-optimal (or MAP) q also changes. Moreover, such a modification 
invariably excludes some q that were previously allowed (see Sec. l4.7fl . Instead 
what we want is our increase of the breadth of the posterior over q to allow 
previously excluded q, while still allowing all q we did earlier. 

The exclusionary character of the posterior over q that is causing this diffi- 
culty can be removed by casting the analysis in terms of type II games rather 
than (as in the exposition above) type I games. After all, in general the tt that 
obeys the conventional type II game invariant has support extending over all q 
(cf. Eq. I42|) . A detailed exploration of how to use type II games to incorporate 
the effects of the {V l (xi)} and overlaps between the {u l (xi, .)} into our posterior 
is beyond the scope of this paper however. 

The discussion so far has focused on variants of J 2 " that are at most loosely 
based on empirical data. Those variants incorporate none of the insight of 
behavior economics, prospect theory, or behavioral game theory |44l 1451 15T1 I43| . 
Crucially important future investigations involves incorporating that work, and 
more generally the entire field of user-modeling and knowledge-engineering, into 
our choice of ^ . 

Finally, it is worth noting that there are alternative ^'s that don't involve 
the numeric values of the i^'s, but rather only require that each u l provides 
an ordering over the q. The idea here is to consider what is invariant if i stays 
"just as smart" , while U l undergoes a non-affine monotonically increasing trans- 
formation, and q l changes accordingly. For example, one might argue that qi 
would be "just as smart" after such a transformation if the fraction of alterna- 
tive q[ such that q[ ■ U l > qi-U % is the same before and after the transformation. 
Formally, this would mean that J dq[ Q(U l ■ [qi — g,-]) is a constant, rather than 
(as in the choice considered in this paper) U l ■ qi. Intuitively, under this choice, 
"how smart" i is reflects how good she is at ruling out some of the candidate 
q\ as inferior to the final qi she uses. 27 This formalization of how smart i is is 
essentially identical to what is called intelligence in work on Collective Intel- 
ligence jnamiEii. 

27 One obvious variation of this measure of how smart i is is to replace the uniform measure 
in the integral J dq^ Q(U* ■ [qi — q'A) with a non-uniform one, for example emphasizing those q 1 - 
having larger dot product with U*. A related variation would replace the Heaviside function 
in the integrand with some smooth increasing function, e.g., a logistic function. 
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5 Independent players 



5.1 Basic formulation 

When the players have never previously knowingly interacted, there is no sta- 
tistical coupling between the associated mixed strategies, In this case 
the setup for coupled players (Sec. 0} does not apply. Instead we must sepa- 
rately specify likelihoods for each of the players. The joint likelihood is then the 
product of those separate likelihoods. 

Here for simplicity we consider a game of complete information, so that every 
player knows the move spaces and utility functions for all players. Intuitively, 
those players are not just unaware physical particles, without any "goals" that 
they are "trying" to achieve. Rather each is a reasoning entity, trying to max- 
imize its own utility, and it knows the same holds for the other players. This 
results in the "I know that you know that I know that you prefer ..." common 
knowledge feature that lies at the core of many views of game theory. 28 

Intuitively, this overlap in knowledge among the players acts as a "virtual 
coupling" between the players. However it is not a formal statistical coupling. 
After all, as mentioned above, P(J? \ q) = Y\.%P{^ I 1i) f° r our independent 
players invariant J?. Therefore (for an entropic prior) the posterior distributions 
over mixed strategies are statistically independent: 

P(q\s) = n p ^l^)- (46) 

i 

Given this independence, how do we capture the "virtual coupling" , so crucial 
to noncooperative game theory, in the independent-players invariant ^1 

To answer this, concentrate on some particular player i. As a surrogate for 
virtual coupling, say we had a game of actual coupling, as in Sec.0] That would 
set up a distribution over the joint moves of the players other than i, 

P{x'_ t \J c ) = 



oc 

where the subscript c on the invariant indicates it's the invariant for a counter- 
factual coupled players scenario, a 1 is an associated entropic prior constant for 
player i, and each e*. is an associated Boltzmann utility function, with (implicit) 
Boltzmann constant by 

28 See 1541 for a fine-grained distinction between such "common knowledge" and "mutual 
knowledge" ; such distinctions are not important for current purposes. Also see 1551 for related, 
qualitative discussion. 



J dx\ P^x'^Jo) 

J dx'idq q{x' t ,x'_ i )P(q \ J c ) 
J dq[J dx' i q{x' i ,x'_ i )]P{q\jr c ) 
J dq g-iOO J] e a ' s ^S(q 3 ■ W q - e}(C/|)) (47) 
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Now if player i makes move Xi, and the remaining players make move x'_ i; 
then the utility for player i is u l (xi, x'_^}. Accordingly, if the distribution over 
x'_ i were actually given by Eq. El t nen the expected utility for player i for 
making move Xi would be 

lP e {xi) 4 J dx'_ t uHx u x'_ t )P(x^ | J c ) 

(48) 

Note that U l c implicitly depends on an associated value a 1 , as well as on the 
values {bj} parameterizing the set of functions {e}}- 

Say that in choosing its move player i assumes that its actual utility U l is 
well-approximated by U l c for some appropriate a 1 and This means that the 

reasoning of player i reflects the "I know that you know ..." common knowledge 
feature of game theory; it makes its move under the presumption that the 
counterfactual coupled players scenario gives a good approximation to its actual 
environment. With this approach, there is no infinite regress difficulty like that 
underlying other approaches to the issue of common knowledge. (This reliance 
on counterfactual coupling to formalize that common knowledge feature can be 
viewed as an alternative to approaches like Aumann's epistemic knowledge |56|.) 

Note (as discussed just below Eq. I2.'i[l that p qua d — J dq qP(q \ J^ c ) is 
the Bayes-optimal distribution over joint moves under quadratic loss and the 
invariant J^ c . So the distribution P(x^i \ J? c ) underlying U l c is the same as the 
distribution induced by sampling that single Bayes-optimal distribution. Also 
recall that p qU ad(x) is not a product distribution; under it the moves of the 
players are not statistically independent. So we are modeling every player i 
as though she achieves a certain performance level for a counterfactual game 
in which all the players (herself included) make their moves according to the 
(coupled) distribution p qua d — but in reality she is free to make moves according 
to a different distribution. 

Say that player i makes the perfectly rational move for the counterfactual 
game. In this situation, player i chooses her moves on the presumption that the 
other players all behave according to that counterfactual game. The coupling 
in that counterfactual game can be viewed as how player i's implements the 
common knowledge reasoning underlying much of conventional game theory. 
Our presumption, formalized below, is that while the behavior of player i will 
not necessarily be perfectly rational for the counterfactual game, that behavior 
can be approximated as though she is trying to behave that way. 

Say that all players i go through the kind of counterfactual reasoning outlined 
above for associated values of a 1 and {&*■} that do not vary much between them. 
Then they will all have used very similar distributions P(x \ ,y c ) to choose 
their moves. This commonality in their reasoning will not statistically couple 
their moves; Ea. 1461 will still hold. However it will generate the virtual coupling 
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inherent in the "I know that you know ..." feature. Intuitively, it is because 
they all model the "I know that you know ..." phenomenon in terms of similar 
statistical coupling scenarios that they are virtually coupled. 

Now in practice, no player i will exactly evaluate such a counterfactual cou- 
pling scenario to get a guess for U % (and indeed may not even be able to, for 
example due to computational limitations). But we can presume that each such 
player will go through reasoning not too different from such an evaluation, for 
some particular a 1 and {b l j}. Accordingly, as a surrogate for each player i's 
actual reasoning, and the associated virtual coupling among all the players, we 
can stipulate that each player's reasoning results in a mixed strategy qi that 
is highly consistent with a counterfactual statistical coupling scenario given by 
Eq.EHl 

To formalize this we must define what it means to have qi be "highly consis- 
tent" with One natural way to do that is by stipulating that qi-U* = Ki for 
some parameter Ki, exactly as in the discussion of effective invariants in Sec. 
13.21 In other words, we stipulate that E Pquad (u l ) be an i-dependent constant. 
Plugging it in, this definition of "highly consistent" gives us our invariant for 
player i, i.e., it gives us the likelihood over q for each player i. 

In Sec. 15.31 we will replace each Ki with an equivalent parameter /3j that 
is easier to work with. This parameter will just be the parameter saying how 
smart qi is for utility U l c , as in Eq. ED and the associated discussion of effective 
invariants in Sec. 13.21 To have the notation reflect this alternative parameter- 
ization we will sometimes write K(Ul,(3i) (again, just as in Sec. I3.2fl . One of 
the major advantages of parameterizing the i'th likelihood with j3i rather than 
Ki is that f3i always ranges from to +oo, for any game, for any player i, and 
independent of what q-i is. This is not the case for Ki] its range of values will 
depend on q_i in general. Intuitively, Pi is simply Ki normalized to account for 
this. 

As mentioned above, since the players are independent, the joint likelihood 
is the product of the separate individual likelihoods. Using our notation for Ki, 
we can write this likelihood as 

P(.y\q) = l[P(-S\q t ) 

i 

= HS^-U^-KiU^fr)), (49) 

i 

with each U l c given by Eq. E3S1 

Comparing this with Eq.|551 and recalling that ti{U l q ) = K(U q , hi), we arrive 
at an alternative motivation for the choice of Eq.0^]for the independent-players 
likelihood. Our presumption for the independent players scenario is that each 
player is coupled to an environment in the exact same way as in the coupled 
players scenario, via the function e l for some appropriate Boltzmann exponent 
(labeled Pi for the independent players scenario, and b t for the coupled players 
scenario). However in the coupled players scenario the environment of each 
player i is set by the actual q-i. In contrast, in the independent players scenario, 
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each player i's environment is set by a counterfactual q-i. Intuitively, we are 
presuming that each player i acts just as we do, when we make predictions for 
a coupled players scenario. 

Plugging in, the posterior for the independent players scenario is given by 

P(q | J) cx e aS(q) P(y | q) 

= We^Siqi-Ui-KiUlfa)). (50) 

i 

Plugging Eg. 1481 into this result, we get the posterior probability over q for 
independent players: 

P(q\S) oc l[e aS ^5[K(Ui,^) - 

i 

Jdx^ t dq' u i {.,x- i )q'_ i (x- i )n j e aiS < q 'M<l' j ; % ~ ^W^U 

(51) 

Next we plug in the usual coupled players et-: 
P(q\S) oc He aS Ms[K(U l c ,pi) - 

i 

fdx-jdg' ^(■,x_. 1 )^_, 1 (x- I )n 3 e^'^Sjg'j ■ U{, - K{U^b))) 
^ ' J dq< Uj eWMfi ■ U j q , - K(U>, , J ' 

(52) 

where the K function is as dehned in Sec. 13.21 with U l c given by Eq. 0H1 and 
parameterized by a 1 and the set of values As usual, the posterior over x 

is given by J dq q(x)P(q \ J?) and is identical to the Bayes-optimal q under a 
quadratic loss function. 

Intuitively, for fixed i, the {&*-} are how smart player % imputes the other 
players in the counterfactual game to be, which she uses to encapsulate the 
common knowledge aspect of the game. So it encapsulates how she thinks 
the other players will choose their moves. In particular, she presumes that in 
formulating their mixed strategies, the other players will consider how smart 
she is to be b\. Player i then uses a 1 to set the relative probabilities of the 
q's that are all consistent with those More carefully, a 1 and the {&*•} 

serve as our presumptions of the values of these quantities inherent to player 
i. Properly speaking, we do not really presume that she explicitly has such 
quantities and uses them to calculate a counterfactual game. Rather we presume 
that her behavior can be well-approximated by such a common-knowledge type 
of reasoning by her. 

In contrast, reflects our assessment of how well player i carries out such 
reasoning. It measures how smart we believe she is in evaluating the counterfac- 
tual game, and even the degree to which that game really guides her choice of 
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move, a then controls the relative probabilities of the g's that are all consistent 
with our assessment of a 1 and the for all players i. 

Note that since P(q \ is a product distribution, if P(qi \ J?) changes, 
there is no effect on P(qj-a \ ,J?). This is true even if the players are all 
fully rational, both in the actual game and the counterfactual game, so that 
with probability 1 the system is at a Nash equilibrium of the original game. 
Accordingly, issues like whether such a Nash equilibrium is "stable" do not occur 
in the independent players scenario. Any changes to (the distribution governing) 
player i's mixed strategy has no effect on the (distribution governing) the mixed 
strategy of any other player j. This is because such a player j is playing best- 
response to the counter-factual game, not to the actual game. 

5.2 Independent players and the impossibility of a Nash 
equilibrium 

Since P(q \ ^) for independent players is a product, 

P(x \J) = J dq q{x)P(q | J) 

= II / dq l q l {x l )P(q l \ J) (53) 
i •* 

So our estimate of the joint distribution over moves, P{x \ is a product 
distribution. This contrasts with the coupled players scenario (see Sec. I4.3f) . 

However just like in the coupled players scenario, in general the distribution 
P(x | J?) need not be a Nash equilibrium, even if the players are all fully rational. 
This can be the case even if the players all agree exactly on the counter-factual 
game, and if that agreed game is one in which everyone is perfectly rational. 
This non-Nash result can even hold if our inference P(x \ J?) is a delta function 
(so that P(q | </) is as well), one that exactly describes the actual joint move 
of the players in the real (not the counter- factual game). So unlike the coupled 
players scenario, there is no issue with how we, the external scientists go about 
our inference; our inference is in fact exactly correct. 

Stated differently, assume the following: 

1 . The players all share the exact same "common knowledge" , namely that 
they all perform perfectly rationally; 

2. They all perform perfectly rationally for that common knowledge (so that 
common knowledge is in fact correct); 

3. As a result they definitely make a particular joint move x (i.e., their joint 
mixed strategy is actually a joint pure strategy). 

Then it may still be that that x is not a Nash equilibrium. This is illustrated 
in the first example. 
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Example 1: As an example, say that all players agree on the counterfactual 
game, and it's a game in which the players all play perfectly rationally, i.e., the 
bj are all infinite. Also have each player be perfectly rational, i.e, have all 
be infinite. Say that the game has two non-exchangeable pure strategy Nash 
equilibria, x*(l) and x*(2). 

Evaluating, U^(xi) for this scenario is the expected payoff to player i if she 
makes move Xi, and if the distribution of other players' moves, P(x-i \ J? c ), is 
given by the uniform average of 5 X _. >x « ,m and 5 x _ ifX * (2)- 29 Now since player i is 
perfectly rational (for the counterfactual game), she will play a mixed strategy 
that is payoff-maximizing for this environment, U % c (xi). More precisely, her 
distribution P(qi \ J?) has its support restricted to such mixed strategies qi. 30 
However that environment is not the one that arises if the other players are 
all playing optimally, i.e., it equals neither the environment of player i for the 
Nash equilibrium u(xi, 2^(1)) nor the environment for the Nash equilibrium 
u(xi, #1^(2)). Accordingly, in general the optimal <?, for the counterfactual game 
— the mixed strategy played by player i — is neither of the two associated Nash 
equilibrium pure strategies, 5 Xi ,x*{i) nor &xi,x*(2)- 

To illustrate this, say that player i has three possible moves. Have the 
payoff to player i for those three moves, x*(l), x*(2) and x*(3) be given by the 
vector (10,0,9) when the other players collectively make Nash move rrLj(l). 
Have those payoffs be (0, 10,9) when the other players collectively make Nash 
move 2:^(2). (So x*(l) is indeed best-response for xl^l) and x*{2) is best- 
response for 2^(2).) However the distribution over the other players' moves 
that i considers is 

2 ' (M) 

The best response mixed strategy player i can play for this distribution is 
3xi,x*(3)y for which the expected payoff is 9. (The expected payoff for the other 
two pure strategies are both 5.) This is neither of the two original game Nash 
equilibrium moves for player i, which establishes the claim. 

So even the outcome of the pure rationality independent players scenario 
need not be a Nash equilibrium of the original game. This is quite reason- 
able. After all, unless there's collusion or some form of (knowing) interaction 
between the players in the past (even if mediated by intermediaries, e.g., via a 
social norm), then there's no way they can coordinate. Intuitively, each player 
i must "hedge her bets" . She presumes that the other players will be playing 
a Nash equilibrium, but since there is more than one such equilibrium, each 
with a non-zero probability, she must take both into account in choosing her 
move. This means that her move will not be optimal for either one of the Nash 

29 See Ea. 1371 Since the two Nash equilibria are both pure strategies, they have the same 
entropy (zero) . This means they have the same value of their prior probability, and therefore 
the same posterior probability in the counterfactual game. 

30 As usual, the relative probabilities of those g; will be given by (the appropriate exponential 
of) their entropies (cf. Eq . 1501 . and the distribution over her moves that we estimate, P(xi | 
JP), is given by Eg. 1531 for this P(qi \ 
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equilibria considered by itself. 31 This contrasts with the type of situation that 
would prevent us from predicting a Nash equilibrium for the coupled players 
scenario. There the difficulty can arise when we, the external scientists making 
the prediction, are forced to hedge our bets. 

Example 2: Now consider another scenario where again the players and their 
counterfactual versions are all fully rational. In this scenario say there is a single 
Nash equilibrium in mixed strategies, an equilibrium under which it not the case 
that each player's mixed strategy is uniform over its support. 32 

As usual, player i considers the counterfactual game to predict what the 
other players are doing. Doing this gives her a set of moves that she could 
make, all of which are best-response. Now by symmetry, our estimate of i's 
distribution, P(xi | is uniform over those best-response moves of hers, 

and zero elsewhere. 33 This uniformity will hold for all players. Therefore the 
estimate we make of the joint mixed strategy is not the Nash equilibrium of the 
game (under which some players have mixed strategies that are non-uniform over 
their support). This does not mean that we claim that the Nash equilibrium is 
impossible. We assign non-zero P(q \ J') to that Nash equilibrium q in general. 
It is just that our estimate of the joint mixed strategy will not be that Nash 
equilibrium. 

This contrasts with the coupled players scenario. In that scenario, if you are 
explicitly provided ^ saying that all players are perfectly rational, then it is 
precisely that that tells you that player i must play the Nash equilibrium non- 
uniform distribution. If you are not provided that explicit prior information, 
then in fact you should not assume that there is perfect rationality. 

31 A similar phenomenon occurs in simply single-dimensional decision theory. Under 
quadratic loss, if P(z) is the actual distribution of a random variable, the Bayes-optimal 
prediction — the prediction that minimizes expected loss under that P — is y = Ep(z). 
That expectation may even be a point where there is zero probability mass, i.e., it may be 
that P{y) = 0. 

32 Some have worried that this scenario calls into question the validity of the Nash equi- 
librium concept. The issue is why a player i should play a particular non-uniform mixed 
strategy over its best response pure strategies, when the only "advantage" of that mixed 
strategy is that it happens to make the mixed strategies of other players be best-response. 
See for example 157115511551 . 

33 It is interesting to consider this result in light of experimental and theoretical work con- 
cerning risk-dominant Nash equilibria. 
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5.3 The MAP q for independent players 

Since for independent players the posterior is a product distribution, the MAP 
q is also. So with some abuse of notation, we can write 

MAP(q) = aigmax q P(q | J) 

= argmax g J[ e aS ^S( qi ■ U' c - K(U l c , ft)) 

i 

= f| argmax^ P(q t | J) 

i 

= J]MAPfe) (55) 

i 

where the index variable x — {x\,X2, . . .) is implicit, as is the conditioning on 
the independent-players J? . For notational simplicity define 

MAP 4 q z (56) 

for each i, So we can rewrite Eq. [331 as <7 = Yii 

In the usual way, by maximizing entropy subject to the associated equality 
constraint, each qx can be written as e^ iUc ^ Xi > up to an overall proportionality 
constant. Recall that in writing q~i this way that /3, is the Lagrange parameter 
enforcing our constraint that U*-qi = iQ, i.e., enforcing our restriction that the 
qi be "well-consistent" with U l c . Writing it out, 

ui-qi = K(ui,0t) 

fd Xl e^ u ^U l c ( Xi ) 
Jdx t eP* u iM 

Given this form for each q~i , we can write the value of q for some arbitrary x 

as 

q(x) oc Y[e^' u ^\ (58) 

i 

where the proportionality constant is independent of x. Plugging in, this be- 
comes 




(59) 

where for simplicity we have absorbed all proportionality constants into 
writing that new value of (3i as /3|. 

As an example, say that our game has a single Nash equilibrium over pure 
strategies, x* . Let the bj (implicit in the e*) all go to infinity, keeping the a 1 all 
finite, in such a way that the posterior distribution over q for the counterfactual 
coupled game approaches a single q which is a delta function about x*. So U l c {.) 



(57) 
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approaches u l (., x*_j). Then q approaches a product of (independent) Boltzmann 
distributions: 

q{x) oc Y[e^ u ^ Xi ' x -<l (60) 

i 

This is a product of mixed strategies, each of the form of the Boltzmann dis- 
tribution. As such it is similar to the QRE. Unlike the QRE though, there is 
no coupling between the different mixed strategies comprising q. This reflects 
the fact that, by hypothesis, the players are independent of each other in how 
they form their mixed strategies, as well as in the subsequent moves they make. 
Whenever there is such independence — which is the case in much of conven- 
tional noncooperative game theory implicitly or otherwise — the QRE is not 
an appropriate choice for what kind of product of Boltzmann distributions to 
use to capture bounded rationality. 

Now say the counterfactual game has two pure strategy Nash equilibria, 
x*(l) and x*(2), and that in evaluating the counterfactual game agent i gives 
them probabilities q and 1 — c,-, respectively. Then rather than Eg. I6UI we get 

q(x) cx [JJe/WV*.*^ 1 ))] x [JJeACi-^-'^^-iW)], (61) 

i i 

i.e., a product of the kind of equilibria arising for the two Nash equilibria taken 
separately. If 0i — ► oo, then agent i chooses the best response to either 2^(1) 
or 2^(2), depending on which gives i higher expected payoff (where the expec- 
tation is evaluated according to the distribution (cj, 1 — Cj)). 

6 Miscellaneous topics 

This section presents some illustrative extensions of the basic PGT framework 
presented above. 

6.1 Cost of computation 

For a large range of games, the independent players scenario results in a tradeoff 
between how smart a player is and the cost of the computation they must engage 
in to determine their behavior. This relation between the cost of computation 
and bounded rationality emerges from the mathematics; it is not some ad hoc 
hypothesis we make to explain the observed (bounded rational) behavior of real 
human beings. In addition, using this mathematics, we can quantify the tradeoff 
and when it occurs, and more generally determine what characteristics of the 
game are most intimately related to the tradeoff. (All of that analysis is the 
subject of future work.) 

Say /3i increases while all other parameters are fixed, so U l c doesn't change. 
Then the set of qi satisfying our invariant J 1 shifts (cf. Sec. l3.2f) . Typically such 
shifts in that set arising from increases in /3, also shrink that set (i.e., its measure 
decreases). Intuitively the smarter player i is (for the counterfactual game), the 
more assured it is in assessment of the counterfactual game, and therefore the 
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more assured it is in making its move. As an example, say that a* and the 
values {bj} restrict P(q | J c ) to one q that is a Nash equilibrium of the game, 
an equilibrium which is a joint pure strategy of the players. So P(x-i | ^f c ) 
is a delta function about the moves of the players other than i at that Nash 
equilibrium. Then for /3j — » oo, qi also becomes restricted to that equilibrium, 
i.e., the support of the likelihood, P(J 2f | qi) gets restricted to a single qi (one 
that is a delta function about that Nash equilibrium's Xi). Accordingly the 
measure of qi allowed by the likelihood goes to as /3$ approaches infinity. 

When the set of qi allowed by the likelihood shrinks this way, the set of qi 
allowed by the associated posterior, P(qi \ <#) (i.e., the set of qi in the support 
of that posterior) must also shrink. Typically this mean that the entropy of that 
posterior shrinks. Usually this in turn means that the integral of that posterior, 
P(xi | <#) = J dqi qi(xi)P(qi | J') 1 also get a smaller entropy as (3i increases. 
We can illustrate this by returning to our single pure strategy Nash equilibrium 
example. In that example, for [3i — > oo, the support of P(xi | </) gets restricted 
to the Nash equilibrium Xi, and therefore its entropy goes to zero, the smallest 
possible value. As another example, recall from Sec. 13. 21 that since we have fixed 
[/*, the entropy of the MAP qi cannot increase as Pi increases. 

In such situations, all these distributions with decreasing entropy have more 
and more information as Pi increases (recall that the amount of information in a 
distribution is the negative of its entropy). Now model agent i's computational 
process (in deciding how to move) as starting with the assumption that a 1 , {b*} 
accurately describes the other agents, so that the associated counterfactual game 
results in an accurate approximation of IP . Under this model, we can interpret 
the amount of information in P{xi | J^) as the amount of "computational effort" 
i expends to try to approximate P(x-i \ J'c) accurately and guess accordingly. 

As just argued, typically that amount of information in P(xi \ J? c ) — the 
negative of its entropy — increases as Pi does. So under this model, the larger 
Pi is, the more computational effort i expends. On the other hand, assume 
that the a 1 , {ir*} going into i's counterfactual game calculation give an accurate 
approximation to the actual U l . In this case, the expected payoff to i rises as 
Pi does. So when the a 1 , give an accurate approximation to U l (i.e., i's 
modeling is accurate) , rising Pi both means more expected payoff to i and more 
computational effort by i. Evidently Pi controls a tradeoff between how smart 
i is and how much computational effort it expends. 34 

6.2 Rationality functions 

In many situations it would be useful to have a way of quantifying the rationality 
of a player i, based purely on its behavior, without any model of its decision- 
making process (even as ill-specified a model as saying that the player "evaluates 

34 The analogous argument for the coupled players scenario is more problematic. This is 
because as i changes her distribution, for example by increasing her (coupled players value) 
bi, the distribution of the other players must also change, due to the coupling between players. 
This means that the effect on the entropy of i's distribution and to her expected payoff can 
be more complicated. 
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a counterfactual game to some given degree of accuracy"). We would like to be 
able to do this for any mixed strategy qi and for any environment U l (whether 
that mixed strategy is the choice of player i, as in type II games, or instead 
governs how i makes choice, as in type I games) . We would like similar generality 
for judging potential moves ir,. 

In particular, we do not want to require that the mixed strategy of real- 
world players has some a pnorz-specified parameterized form, e.g., a Boltzmann 
distribution over its environment. We do not want to assume that our data is 
a (perhaps noise-corrupted) stochastic realization of such a mixed strategy, and 
accordingly solve for the best-fit values of the associated parameters to some 
experimental data (as is done in much of the experimental work involving the 
QRE, e.g., \'2'2\). After all, any requirement that the mixed strategy of a real- 
world player is exactly given by such a parametric function will almost always be 
in error, at least to a degree. This section presents such a broader quantification 
of rationality. 

Consider the situation where players i has mixed strategy qi and her envi- 
ronment is some fixed U l . It is reasonable to say that two choices of q l are 
equally rational if they have the same dot product with U l . However we will 
often want to do more than simply say whether two qi are equally rational for 
some particular U l ; we will often want to say whether a operating in envi- 
ronment U l is more or less rational than a q[ operating in environment (U') i . 
To do this we need a scalar- valued function R(V, p) that measures how rational 
an arbitrary distribution p(y) is for an arbitrary utility function V(y), i.e., that 
measures how peaked p(y) is about the maximizers of V(y), axgmax y V (y) , and 
about the other y that have large V(y) values. 

Say that p is a Boltzmann distribution over V(y), p{y) oc e pv W>. Then we 
can use information theory in general, and effective invariants and the functions 
6i discussed above in particular, to motivate quantifying the rationality of p for 

V as the value f3. The larger (3 is, the more peaked p is about the better mixed 
strategies, and therefore the more "rational" p is. 

In addition, so long as p" and p' are Boltzmann distributions for V" and 

V respectively, this measure of the associated (3 value can be used to compare 
the rationality of p" for V" with the rationality of p' for V'. We can do this 
even if the range of the function V differs from that of V". This attribute 
of our measures differs from other naive choices for measuring rationality. In 
particular, it differs from the choice of measuring rationality as p ■ V, which 
not only reflects how peaked p is about y that give large V(y), but also reflects 
the range of values of V(.). (Indeed, simply translating the values of V{.) by a 
constant will modify the value of this alternative choice of rationality function.) 

In general though p will not be a Boltzmann distribution. So we need to 
extend our reasoning, to define an R that we can reasonably view as a quantifier 
of rationality for any p. Formally, we make two requirements of R: 

1. If p(y) oc e^ v ^ v \ for non-negative (3, then the peakedness of the distribu- 
tion — the value of R(V,p) — is (3. 

2. Out of all p satisfying R(V,p) = (3, the one that has maximal entropy 
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is proportional to e~P y ( v \ In other words, we require that the Boltz- 
mann distribution maximizes entropy subject to a provided value of the 
rationality/temperature. 

We call any such R a rationality function. 

Note that a rationality function can be applied to physical systems, where 
V(y) is interpreted as the Hamiltonian over microstates y. Such a function is 
defined even for systems that are not at physical equilibrium (and therefore 
aren't described by Boltzmann distributions). In this, rationality functions are 
an extension of the conventional definition of temperature in statistical physics. 

As an illustration, a natural choice is to define R(V,p) to be the (3 of the 
Boltzmann distribution that "best fits" p. To formalize this we must quantify 
how well any given Boltzmann distribution "fits" any given p. Information the- 
ory provides many measures for how well a distribution pi is fit by a distribution 
Pi- On such measure is the Kullback-Leibler distance 8, 601 I46|: 

KL( Pl \\ P2 )AS( P1 \\ P 2)-S( Pl ) (62) 

where S(pi \ \ pi) = — f dy pi(y)ln[E^jy-] is known as the cross entropy from 
Pi to pi (and as usual we implicitly choose uniform \x). 

The KL distance is always non-negative, and equals zero iff its two arguments 
are identical. In addition, KL(ap l + (1 — a)p 2 \ \ p 2 ) is an increasing function of 
a € [0.0, 1.0], i.e., as one moves along the line fromp 1 top 2 , the KL distance from 
p 1 to p 2 shrinks. 35 The same is true for KL(p 2 \ \ ap 1 + (1 — a)p 2 )- In addition, 
those two KL distances are identical to 2nd order about a = 0. However they 
differ as one moves away from a — in general; KL distance is not a symmetric 
function of its arguments. In addition, it does not obey the triangle inequality, 
although it obeys a variant [B]. Despite these shortcomings, it is by far the most 
common way to measure the distance between two distributions. 

Recall the definition of the partition function, Z(V) = f dy e v ^ (the nor- 
malization constant for the distribution proportional to e v ^). Using the KL 
distance and this definition, we arrive at the rationality function 

Rkl(V,p) = argmin l3 KL(p \\ — — ) 

= argmin^h/3 J dy p(y)V(y) + ln(Z((3V)) - S(p)} 

= argmax^/? J dy p{y)V{y) -\n{Z{pV))]. (63) 

In |25| it is proven that Rkl respects the two requirements of rationality func- 
tions. Note that the argument of the argmin is globally convex (as a function 
of the minimizing variable (3). In addition its second derivative is given by the 
variance (over y) of the Boltzmann distribution e^ v ^/Z(/3V). This typically 
makes numerical evaluation of Rkl quite fast. 

35 This follows from the fact that the second derivative with respect to a is non-negative for 
all a, combined with the fact that KL distance is never negative and equals when a = 0. 
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Comparing the definition of Rkl to Eq. ^3 we see that the KL rationality 
of a distribution p is just the value of (3 for which p has minimal free utility 
gap. When p is a Boltzmann distribution over the states of a statistical physics 
systems, this (3 is (the reciprocal of) what is called temperature in the in App.|5J 
Systems described by such distributions are at physical equilibrium. In other 
words, the physical temperature of a physical system at physical equilibrium 
is (the reciprocal of) its KL rationality. KL rationality is also defined for off- 
equilibrium systems however, unlike physical temperature. 

To help understand the intuitive meaning of the KL rationality function, 
consider fixing its value for agent i to some value pi. Say q-i is also fixed (and 
therefore so is player i's environment, U q _.). Then there is a value such that 
the set of all qi having rationality value pi is identical to the set of all qi for 
which E qi (U q _.) — en. In fact, ai is the expected value of U l that would arise 
if qi{xi) were a Boltzmann distribution (over U q _.(xi) values) with Boltzmann 
exponent ft = pi. 36 

So knowing that player i has KL rationality pi is equivalent to knowing that 
the actual expected value of XJ % under qi equals the "ideal expected value" , in 
which qi is replaced by the Boltzmann distribution over U q _.(xi) values with 
exponent ft = /Jj. (However note that such a constraint on the value of pi does 
not actually specify q_i, so it does not specify that ideal expected value of U l .) 
The (loose) physical analog of this result is that all distributions over states of 
a physical system having the same (potentially non-equilibrium) temperature 
also have the same expected value of the Hamiltonian. 

Comparing with the discussion in Sec. 0] we see that specifying the KL 
rationalities of all the players is exactly the same as specifying that they all 
obey the coupled players invariant, with the parameters of the functions Ci 
given by those specified rationality values. An .y specifying the one scenario is 
identical to an J* specifying the other one. Accordingly, all the discussion in 
Sec . ' s l4~Tl 14 . 81 holds for making predictions based on specified rationalities of the 
players. In particular, as discussed in Sec. 14.11 the rationalities of the players in 
a game reflects the structure of that game, as much as it reflects the intrinsic 
characteristics of the players. 

All of the foregoing was for quantifying the rationality of a particular q^. 
However we can view the rationality of a particular special case, where 

the "mixed strategy" qi is a delta function about one of its moves, (behavior ally, 
it makes no difference if that Xi is a sample of some preceding qi that i chose, or 

36 To see all this, note that by definition of KL rationality function, 

dbx{z(pu*_.)) f 

go —\p=n KL (u*_., qi ) = J H&iWl-Mi)- 

However by the discussion in Sec. 13.21 we know that the quantity on the left-hand side is 
just the Boltzmann utility evaluated at the specified value of f}, Ki(/3). So Rkl{U^_. ,qt) = 

RKL(Ui_.,qr) => K(R KL (Ui_ v qi)) = K{R KL {W q _ z ,q[)) e> E qi {U\_.) = E^U^.). So 
any two q;'s with the same rationality must have the same expected U q _. . To prove the other 
direction, recall that for fixed U^_. , the Boltzmann utility is a bijection from values of (3 into 
R. QED.. 
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instead is i's choice directly.) Plugging that in to the KL rationality function, 
we get the following definition of the rationality of a move xf. 

Say that a player i makes move Xi when there is an environment U l . Then 
the KL rationality of that move is the (3 such that if i had instead chosen a 
Boltzmann mixed strategy with exponent /3, the resultant expected value of 
u 1 would have been the same as i's actual expected utility. Formally, the KL 
rationality function is the mapping from {xi, U l ) to the (3 such that 

f dx'lJHx'^e^ 1 ^ 
6.3 Variable numbers of players 

There are many statistical ensembles considered in statistical physics in addition 
to the CE. In particular, in the Grand Canonical Ensemble (GCE), the numbers 
of the particles of various types in the system is itself a stochastic quantity, in 
addition to the states of those particles. This is how one analyzes the statistics 
of physical systems involving chemical and/or particle physics interactions that 
change the particles of the system. 

Recall that the CE can most cleanly be derived as an MAP distribution with 
an entropic prior and an appropriate expectation value constraint (App.EJ). The 
GCE can be derived the same way. Whereas with the CE the expectation value 
constraint only concerns the expected energy, in the GCE it also concerns the 
expected numbers of particles of the various possible types [T§] . 

As pointed out in [111 1251 1ST] , the same same approach used in the GCE 
can also be applied in a game theory context. In such a context, rather than 
"particles of various types" , one has "players of various types" . Broadly speak- 
ing, after this substitution, the ensuing analysis for the game theory context 
proceeds analogously to that of the statistical physics context. 

To illustrate this we present a game theory scenario that roughly parallels the 
GCE. 37 We postulate some pre-fixed set of player types. All players of a given 
type have the same move space and the same payoff function. At the beginning 
of each instance of our scenario, a set of players is randomly chosen, and each 
is assigned a rationality value randomly. Those players are then coupled as 
discussed above in Sec. 01 e.g., via a sequence of noncooperative games, and the 
instance ends with all of the players making a move. 

We know that the expected number of players of any one of the player 
types is the same from one instance to the next, although we do not necessarily 
know that expectation value. We similarly assume the expected rationality for 
each player type (i.e., the expect value of bi, in the terminology of Sec. I4.2f> 
is the same from one instance to the next, without necessarily knowing those 

37 One difference is that the GCE allows arbitrary statistical coupling between all variables. 
In contrast, here we impose numerous statistical independences among the variables, e.g., 
statistical independence between the moves of the players. Another difference is that there are 
multiple utility functions in games, whereas there is only analogous quantity (the Hamiltonian) 
in physical systems. This makes the formulas here more complicated than those in the GCE. 
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rationality values. These rationality values are statistically independent from 
each other. 

We formalize this with an encoding of our variables into x modeled on the 
scheme used to derive the GCE [T5]. For all player types i, xf indicates the 
number of players of that type. For all integers j > 0, and all player types 
i, xfj indicates the move of the j'th player of type i, assuming there is such 
a player (i.e., assuming that j < xf). The meaning of xfj for larger j is 
undefined/irrelevant. Similarly x R a indicates the rationality of the j'th player 
of type i, assuming there is such a player, and is undefined otherwise. 

We write x N ,x M , and x R , respectively to indicate the vector of all player- 
type cardinalities, the (countably infinite dimensional) vector of the moves by all 
possible players (including those that do not actually exist), and the (countably 
infinite dimensional) vector of the rationalities of all possible players (including 
those that do not actually exist). We also write the utility function of the type 
i players as gi(x M , x ), where Vi, gi(x M , x N ) is independent of x^fj Vj > x% . 
Finally, we write N and R4 to indicate the (fixed but potentially unknown) 
expected number of players of type i and expected rationality of those players, 
respectively. 

As in Sec. E21 the moves of our players are independent once the charac- 
teristics of the game are fixed (i.e., we are dealing with a conventional nonco- 
operative game in which the moves are given by sampling an associated joint 
mixed strategy). However here the moves can be statistically dependent on 
those characteristics. For example, if the rationality xfj = for some j < x^ , 
then we know that qf^ must be uniform, independent of the mixed strategies of 
the other playres. 

Reflecting this, we write 

hj V ,3' 

where the products over j and j 1 both run from 1 to 00. When the argument 
makes clear what the superscript {M, N, R} should be, we will sometimes leave 
that superscript implicit. Note that in reflection of the statistical coupling of the 
components of x, q is not a product distribution. So in particular the entropy 
of q is not a sum of the entropies of its marginalizations, as it was above. 
Writing it out, 

( M , N R\ ! dx ' gfr'Wsff ~ ^)8{X' R - X R )5{X' N - X N ) 

q{X ^ 1 X ,X ] J dx> q(x')S{x' R - x R )6(x' N - x N ) A ) 

With some abuse of notation, we will write "qfj(- | x , x )" to mean the 
(infinite-dimensional) vector with component xfj given by q(xf J j \ x ,x R ). 

Our invariant says that each qf must result in an average of xf that equals 
Ni, and similarly for each q R and Ri. It also says that once i" and x R are 
fixed, q M must be the joint mixed strategy appropriate for an associated coupled 
players type II game. To write out this latter condition, first define "— 
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to mean all players other than (including players of type i other than the 
j'th one of that type). Next as shorthand we will often take the distribution 
over all agents other than implicit and write 

lP'Hx M x R x N ) A U l ' J (x M x N ) 

4 J dx_ (iij) q(x M {id) | x R ,x N ) g\x™,x\^x N ) 

(67) 

where we will write U l ^(.,x R ,x N ) to mean the (infinite-dimensional) vector 
with component xffj given by W'i (x^ , x , x N ). So the coupled players portion 
of our invariant says that 38 

q™{. | x R ,x N )-U l ' j (.,x N ,x R ) = K(U i ' j (.,x N ,x R ), xfa) V*J 

(68) 

Combining these three separate aspects of the invariant and explicitly ex- 
panding in full each instance that a component of q occurs, we get 

P{S I 9) = II [ - J dx?q N (x»)x?) 6(Ri - / dxfq R {xf)x R ) x 

i 

JJ f dx N dx R q N (x N )q R (x R ) x 

3 

S(q^(.\x R ,x N )-U^^ xRxN) (.,x N )- 

K{U q M n x R x N\(;X N ), X R j) ) ]. 

(69) 

We then combine this likelihood with an entropic prior over q. This gives 
us the posterior P(q \ As usual, if we wish to we can consider the MAP q 
according to this posterior, various Bayes-optimal q's according to this posterior, 
etc., thereby getting a single distribution over x's. 

Again just like in the usual analysis, as an alternative to these distributions, 
over x's we can simply write P(x \ J?) directly, getting the same answer as the 
Bayes-optimal q under quadratic loss: 

P(x | J) = J dq P{q | J)q{x). (70) 

To evaluate this integral we must use Eq. EUto plug in for q(x), Eq. EUfor the 
likelihood, and then use the usual entropic prior. Also as usual we must be 
careful to calculate the normalization constant for the posterior P(q | J?) and 
divide that into the product of the likelihood and prior. 

38 Unfortunately, even with this abusive notation, book-keeping in the equations can get 
messy. 
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However arrived at, once we get a distribution over x, we can then marginal- 
ize over various components of x to get distributions over the associated quan- 
tities of interest. For example, we can do this to determine the typical move 
of a player of a particular type, the typical number of players of some type 
conditional on a particular move made by the first player of that type, etc.. 

7 Discussion and Future Work 

It is worth comparing PGT to approaches based on models of actual humans 
beings, like those using models of agent learning or models incorporating the 
mathematical structure of statistical physics [SHEII- Broadly speaking, PGT's 
motivation is more like that of conventional game theory than that of model- 
based approaches. Like conventional game theory, PGT investigates what can 
be gleaned by careful consideration of the abstract problem of interacting goal- 
directed agents, before the introduction of experiment-based insight concerning 
the behavior of those agents. 

An even closer analogy to PGT's motivation than that provided by conven- 
tional game theory is Bayesian statistics, and especially Bayesian statistics using 
invariance-based arguments to set the prior yQ. Like such Bayesian statistics, 
PGT is a first-principles-driven derivation of a framework for analyzing systems, 
a framework into which one can "slot in" any kind of experimental data as it 
becomes available." 

While the extraordinary success of statistical physics has been used to choose 
the entropic prior for this paper, it is important to emphasize that many other 
priors can also be motivated using first-principles arguments, many of them 
also based on information-theoretic arguments. Similarly, many other choices 
of likelihood (the invariant) can be motivated (as discussed above). PGT is not 
restricted to the prior and likelihood considered in this paper, any more than 
conventional game theory is restricted to some particular refinement of the Nash 
equilibrium concept. The defining characteristics of PGT is the application of 
such priors and likelihoods to game outcomes rather than (or in addition to) 
within games. The prior and likelihoods considered here are simply the examples 
worked out in this initial paper. 

Obviously, if you happen to know what algorithm the players are using, then 
that should be reflected in the likelihood. PGT for various simple choices of such 
algorithms/likelihoods is the subject of future work. More generally, humans 
have lots of cognitive quirks presumably arising due to evolution. Accordingly 
the precise priors and likelihood investigated here may work best for computa- 
tional agents involved in a game with no foreknowledge of the game. Important 
future work involves analysis with other priors and likelihoods incorporating be- 
havioral economics results, prospect theory, etc.. These alternatives can be used 
for the external scientist's assessment of the individual players and/or (in the 
independent players scenario) for the "models" the players have of each other. 

Indeed, PGT can be seamlessly extended to encompass other kinds of ', 
even kinds that do not involve utility functions. In particular, one or more 
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observed samples of a mixed strategy qi can naturally be incorporated into 
the likelihood term, P(qt \ As another example, we can remove from ,f 

the stipulation that our players' choices of pure strategy are independent of 
one another, i.e., the stipulation that we use a product distribution. Doing so 
naturally results in correlated moves among the players, without any need for 
carefully designed ansatz's like those behind correlated equilibria |2(J|. 

Similarly, there is a good deal of empirical evidence that human players do 
not prefer to maximize expected utility functions J dxiqi(xi)U i (xi). Rather a 
long line of experiments starting with Allais' paradox [51] indicate that what 
is invariant in the decision-making of a human i is some non-linear functional 
of its mixed strategy qi. As more gets understood about such psychological 
phenomena |65| it should be straightforward to incorporate that understanding 
into (Bayesian) PGT. One simply changes what is considered invariant from one 
instance of the inference problem to the next, from being a linear functional of 
qi to being some other type of functional. 

Related future work will integrate behavior modeling ( "user modeling" , be- 
lief nets, etc.) with PGT, to get an empirical science of human interactions. 
Such behavior modeling can run the gamut from knowledge concerning humans 
in general (e.g., behavioral economics) to knowledge concerning certain partic- 
ular humans (psychological profiling, and in particular "games against nature" , 
i.e., the decision-making belief net of a particular human, in a non-game theory 
context inSI)- 

In addition to the foregoring, there is a huge amount of future work in PGT 
that carries over from conventional game theory. At the risk of being glib, almost 
every aspect of conventional game theory can be re-analyzed using PGT. This 
includes in particular cooperative game theory, in which context PGT should 
cut the Gordian knot of what equilibrium concept to adopt. Other broad topics 
that should be investigated using PGT — and therefore bounded rationality — 
are mechanism design, folk theorems, and signaling theory. It may also prove 
profitable to have such investigations be extended to allow varying number of 
players. Similarly, "bounded rational" evolutionary game theory, in particular 
for finite numbers of agents, can be investigated using the "GCE" (variable 
number of players) variant of PGT illustrated above. All of this is in addition 
to more circumscribed game theory issues, like different types of noncooperative 
games (Bayesian games, correlated equilibrium games, differential games, etc.). 

Other future work involves completing the analysis of the relationship be- 
tween QRE and the coupled players MAP (and B ayes- optimal) q's. This can 
also be extended to the independent players. Similarly, coverage issues like those 
presented in Prop.'s 1 and 2 for the coupled players scenario bears investigating 
for the independent players scenario. 

Other future work will investigate what happens in the variable number of 
players scenario if the random variable of the number of player of type i is not 
independent of the random variable of the total utility accrued by all players 
of that type. One aspect of such an investigation would see what happens if 
that random variable is statistically coupled to U l /xf , the average, of players 
of type 2, of the expected utility of those players. In particular, it is interesting 



57 



to see what happens if that variable is coupled to xN ^ vj , the ratio of total 

expected utility that is earned by players of type i, divided out among those 
players. 

All of this is in addition to the future work mentioned in the preceding 
sections. 

8 Appendix 1 — Historical context of PGT 

Despite its widespread and profound usefulness in other fields, attempts to use 
Shannon entropy in game theory, psychology, and economics has proven con- 
troversial (see for example |>7| and references therein). By and large though 
those attempts have considered Shannon entropy as a physical quantity occuring 
within the system under study, and then tried to relate that physical quantity to 
other aspects of the system. In contrast, where Shannon entropy has proven so 
successful in statistical physics, statistics, signal processing, etc., is in guiding 
the external scientist in his inference about the system under study. It is in this 
latter sense that Shannon entropy is used in PGT. 

The results in |1 II 1251 |2~T] can be viewed as the first derivation of bounded 
rational equilibria using full probabilistic reasoning. (The arguments in [2*5] 
concerned equilibrium concepts rather than distributions over the space of all 
possible mixed strategies.) It should be noted though that the maxent La- 
grangian has a history far predating both the work in ^] 1251 |2"T] and that in 
|2~5) . As the free energy of the CE it has been explored in statistical physics for 
well over a century. Indeed, the QRE is essentially identical to the "mean field 
approximation" of statistical physics. (See also \6'2\.) 

In the context of game theory, the maxent Lagrangian was given an ad hoc 
justification and investigated in |27U28lR)5| and related work. The first attempt 
to derive it in that context using first principles reasoning occurred in |26| . 

The use of the Boltzmann distribution mixed strategies also has a long his- 
tory in the Reinforcement Learning (RL) community, i.e., for the design of com- 
puter algorithms for a player involved in an iterated game with Nature |69l I7U| . 
Related work has considered multiple computational players [7J [72] ■ In partic- 
ular, some of that work has been done in the context of "mechanism design" of 
many computational players, i.e., in the context of designing the utility func- 
tions of the players to induce them to maximize social welfare |73l 1491 1531 152| . 
In all of this RL work the Boltzmann distribution is usually motivated either as 
an a priori reasonable way to trade off exploration and exploitation, as part of 
Markov Chain Monte Carlo procedure, or by its asymptotic convergence prop- 
erties |73j. 

The work in |75l I7fil I77| in particular, and econophysics in general, also con- 
cerns the relation between statistical physics and the social sciences. In particu- 
lar, much of that work considers the relation betwccnn equilibrium distributions 
of statistical physics and notions of equilibrium in social science settings. None 
of it concerns game theory though. To relate that domain to statistical physics 
one must drill deeper into statistical physics, into its information-theoretic foun- 
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dations as elaborated by Jaynes. The first relatively simple-minded work relat- 
ing information theory, statistical physics, and bounded rational game theory 
this way was |4(j) . 

9 Appendix 2 - Using the Entropic Prior to De- 
rive the CE 

This appendix elaborates — in a very detailed manner — the application of 
the entropic prior to statistical physics that results in the canonical ensemble 
(CE). The level of detail presented borders on overkill. However it turns out 
not to be trivial how to set the analogous details arising in the application of 
the entropic prior to game theory. In addition, the subtleties of how to use the 
entropic pror to derive the CE are invariably slighted in the literature. Hence 
first working through the well-understood statistical physics case can help hone 
intuition. 

On the other hand though, it turns out that in the CE, the temperature T 
— our prior knowledge concerning the system — equals the Lagrange parameter 
of a constrained optimization problem, rather than the value of the constraint 
associated with that Lagrange parameter. This is not the case with PGT, and 
introduces some subtleties that are mostly absent from PGT. In addition, the 
PGT analogue of what in the CE is the system's Hamiltonian function are the 
players' utility functions. While there is a single Hamiltonian in the CE, there 
are multiple utility functions in noncooperative games. Associated with this, in 
PGT there are multiple analogues of what in the CE is (global) temperature. 
All of this introduces complications into PGT that are absent from the CE. Due 
to all this, readers already comfortable with the entropic prior and how to apply 
it in the CE may want to skip this appendix. 

Write the precise microscopic state of a physical system under consideration 
as y. So for example, in classical (non-quantum) statistical physics, y is the set 
of positions and momenta of all the particles in the system. Arguments from 
physics are typically invoked to justify a claim that the temperature T of the 
system "determines the expected energy of the system" for the (known) energy 
function of the system, H(y). 39 

Note that for this conclusion of those arguments to be a falsifiable statement, 
expectation values ("expected energy") must be meaningful. So T must be 
associated with a (falsifiable) physical distribution over multiple y's, q(y), rather 
than with a single y. Typically this distribution arises by not fully specifying the 
starting y and by allowing unknown stochastic external influences to perturb the 
system between when we acquire the value T and any subsequent observation of 
a property of the system. (It is implicitly required that those external influences 
do not change the value that a repeated temperature measurement would give.) 
All that is fixed (in addition to T) are some high-level aspects of how the system 
is set up, and of how it is opened to the external world. (For example, it 

39 Properly speaking, H is the system's "Hamiltonian". 
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may be that the identity of the person performing the experiment, how they 
physically hold the experimental instruments, etc., fixes those physical details.) 
Accordingly, the state y at that subsequent observation can vary. 

So physically, to falsify a prediction of what q is associated with a particular 
T, we can imagine repeatedly setting up our system in the way specified and 
measuring the temperature, then opening it to (unknown) external influences in 
the way specified, and after that recording the resultant state y; the associated 
distribution across y's is the (falsifiable) q. What we arc interested in is the 
relation between that q and the measured T. 

The aforementioned "arguments from physics" tell us that for any specifica- 
tion of how the system is set up and then opened, there is the same single- valued 
function of the measured T to the expected energy of the system under q. So 
via that single-valued function, a particular value of T picks out a unique set of 
q's (namely those with the associated expected energy). 

Formally then, what we know is that the value of the temperature T uniquely 
fixes the expected value under q of a measurement of the system energy H(y), 
independent of the details of how the system is set up and then opened to 
external influences, i.e., T fixes E q (H) = J dyH(y)q(y). In general, for the same 
T, different choices of how the system is set up and then opened to external 
influences will result in a different one of the possible q consistent with the T- 
specificd value of E q (H). However we don't know a priori how the specification 
of the system's setup and opening chooses among the set of all q that are all 
consistent with a particular value of E q (H). So even if the precise value of 
E q {H) were given to us, and even if how the system is set up and then opened 
were also specified, for us, it is as though nothing is specified concerning which 
of the q consistent with E q (H) has been picked out by what we know. (It is 
in encapsulating this ignorance of the distribution across q's that the entropic 
prior will arise.) 

Moreover, while for a particular choice of how the system is set up and then 
opened up we can ascertain the expected energy of the system by repeated 
experiments, often we cannot do this directly from a single one of those experi- 
ments. This means in particular that while typically we can measure T in such 
a single experiment, often we do not know how that value T fixes the expected 
energy under q. 40 So observing T does not always allow us to write down the 
expected energy, only to know that it has been fixed. In such instances, having 
observed T, we do not know what set of q's that value of T has picked out, only 
that there is some such set. 

The invariant J* for this situation fixes T, and therefore specifies that the 
distribution q(y) must lie on a hyperplane of the form E q (H) = h. But it does 
not specify the value h. Nor does it specify anything concerning which q goes 
with any particular h, i.e., it tells us nothing concerning the distribution of q's 
across that hyperplane. Our inference problem is to circumvent this handicap: 
J* is the value T, together with the knowledge that it fixes E q (H), and we 

40 Obscrvationally, that expected energy is defined in terms of a set of multiple experiments 
in addition to the current one. Mathematically, even if we know the Hamiltonian function, 
often we cannot evaluate its expected value for a particular T. 
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must use this to say something concerning q, the quantity we wish to infer. 
Formally, we wish to evaluate P(q | J) for this choice of ,f . (Note that this is 
a distribution across distributions.) 

Note that the distribution q concerns the physical world. So in particular, it 
is experimentally falsifiable. In contrast, a distribution P(q \ y) reflects us (the 
researcher), and our (in)ability to infer q from T and the specification of how the 
system is set up and then opened. Although such a perspective is not required, 
one can interpret P{q \ -J?) as a subjective "degree of belief" in the objective 
(i.e., falsifiable) distribution q. Alternatively, one can view ,f as picking out 
a set of physical instances of our system that are consistent with J* , and then 
interpret P(q | J^) in terms of frequencies of those instances. 

For the reasons elucidated above, it makes sense to use an entropic prior 
over g's for this J 1 ' . With such a prior, the MAP q is the one that maximizes 
5(g) subject to the constraint E q (H) = h. We just happen not to know h. 

This is a constrained optimization problem with unknown constraint value. 
The associated Lagrangian is 



Removing the additive constant (3h and dividing by the constant (3 gives E q (H) — 
. This is known in statistical physics as the free energy of the system. (3 is 
the Lagrange parameter of our constraint. To solve our constrained optimization 
problem, q and (3 are jointly set so that the partial derivatives of .£?(/?, q) are 
all zero. 41 The minimizer of the free energy — the MAP q — is given by the 
Boltzmann distribution, 



For macroscopically large systems, the posterior over q is in essence a delta 
function about the MAP solution, so the Bayes-optimal solution for almost any 
loss function is given by Eq. refeq:statphysex. 

(3 turns out to be the (inverse of) the temperature of the physical system 
(measured in units where Boltzmann's constant equals 1). In other words, 
the invariant of our problem is the value of the Lagrange parameter, not of 
the associated constraint constant. (The precise relationship between (3 and h 
depends on the function H in general.) 

This scenario and its solution q is exactly the CE discussed previously. It is 
the simplest of all scenarios considered in statistical physics (hence its name). 
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41 Throughout this paper the terms in any Lagrangian that restrict distributions to the 
unit simplices are implicit. The other constraint needed for a Euclidean vector to be a valid 
probability distribution is that none of its components are negative. This will not need to 
be explicitly enforced in the Lagrangian here, since this constraint is always obeyed for the q 
optimizing 3?(/3,q). 



J?((3,q)±(3[E q (H)-h}-S(q). 



(71) 



q(y) oc exp(-(3H(y)). 



(72) 
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